Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Paramspace to automate file naming scheme based on wildcards #36

Open
5 tasks done
kelly-sovacool opened this issue Jan 18, 2023 · 2 comments · May be fixed by #40
Open
5 tasks done

Use Paramspace to automate file naming scheme based on wildcards #36

kelly-sovacool opened this issue Jan 18, 2023 · 2 comments · May be fixed by #40
Labels
feature A new feature request or enhancement

Comments

@kelly-sovacool
Copy link
Member

kelly-sovacool commented Jan 18, 2023

Currently, wildcards are hardcoded in I/O filenames. However, users might like to use different parameters (e.g. different outcomes to investigate the same dataset at different taxonomic levels, etc.) to repeat model training. Using Paramspace would help the rule definitions be more generalized instead of hard-coded. See the main snakemake docs and the snakemake.utils api docs for how Paramspace() works.

TODO

  • Write helper functions for Paramspace:
    • Build paramspace from config.
    • Get wildcard pattern with certain wildcards escaped with double braces.
    • Get instance pattern without certain wildcards included.
  • Fix hyperparameter performance combine + plot.
@kelly-sovacool
Copy link
Member Author

A quick proof of concept:

import pandas as pd
import yaml
from snakemake.utils import Paramspace

with open('config/robust.yml', 'r') as infile:
    config = yaml.load(infile, Loader=yaml.Loader)

ignore_keys = ['dataset-csv', 'outcome-colname', 'hyperparams', 'find-feature-importance', 'nseeds', 'ncores']
for k in ignore_keys:
    config.pop(k)
config_df = pd.DataFrame.from_dict(config)
 
paramspace = Paramspace(config_df, param_sep = "_")
print('paramspace.wildcard_pattern:\t', paramspace.wildcard_pattern)
print('paramspace.instance_patterns:\t', [i for i in paramspace.instance_patterns])

output:

paramspace.wildcard_pattern:     dataset-name_{dataset-name}/ml-methods_{ml-methods}/kfold_{kfold}
paramspace.instance_patterns:    ['_0_otu-large/_1_glmnet/kfold_5', '_0_otu-large/_1_rf/kfold_5', '_0_otu-large/_1_rpart2/kfold_5', '_0_otu-large/_1_svmRadial/kfold_5']

Paramspace class doc

@kelly-sovacool kelly-sovacool added the feature A new feature request or enhancement label Jan 18, 2023
@kelly-sovacool
Copy link
Member Author

Now with permutations of lists in config, similar to R's param.grid().

from itertools import product
import pandas as pd
from snakemake.utils import Paramspace
import yaml

with open('config/robust.yml', 'r') as infile:
    config = yaml.load(infile, Loader=yaml.Loader)

ignore_keys = ['dataset_csv', 'outcome_colname', 'hyperparams', 'find_feature_importance', 'ncores', 'nseeds']
for k in ignore_keys:
    config.pop(k, None)

config['seed'] = list(range(100, 102))
conf_lists = {k:v for k,v in config.items() if type(v) == list}
params_df = pd.DataFrame(list(product(*[v for v in conf_lists.values()])), columns = conf_lists.keys())
for k in conf_lists.keys():
    config.pop(k)
for k, v in config.items():
    params_df[k] = v

paramspace = Paramspace(params_df, param_sep = "_")
print('paramspace.wildcard_pattern:\t', paramspace.wildcard_pattern)
print('paramspace.instance_patterns:\t', [i for i in paramspace.instance_patterns])

output:

paramspace.wildcard_pattern:     ml_methods_{ml_methods}/seed_{seed}/dataset_name_{dataset_name}/kfold_{kfold}
paramspace.instance_patterns:    ['ml_methods_glmnet/seed_100/dataset_name_otu_large/kfold_5', 'ml_methods_glmnet/seed_101/dataset_name_otu_large/kfold_5', 'ml_methods_rf/seed_100/dataset_name_otu_large/kfold_5', 'ml_methods_rf/seed_101/dataset_name_otu_large/kfold_5', 'ml_methods_rpart2/seed_100/dataset_name_otu_large/kfold_5', 'ml_methods_rpart2/seed_101/dataset_name_otu_large/kfold_5', 'ml_methods_svmRadial/seed_100/dataset_name_otu_large/kfold_5', 'ml_methods_svmRadial/seed_101/dataset_name_otu_large/kfold_5']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature request or enhancement
Projects
None yet
1 participant