Use Paramspace to automate file naming scheme based on wildcards #36

kelly-sovacool · 2023-01-18T19:01:27Z

Currently, wildcards are hardcoded in I/O filenames. However, users might like to use different parameters (e.g. different outcomes to investigate the same dataset at different taxonomic levels, etc.) to repeat model training. Using Paramspace would help the rule definitions be more generalized instead of hard-coded. See the main snakemake docs and the snakemake.utils api docs for how Paramspace() works.

TODO

Write helper functions for Paramspace:
- Build paramspace from config.
- Get wildcard pattern with certain wildcards escaped with double braces.
- Get instance pattern without certain wildcards included.
Fix hyperparameter performance combine + plot.

The text was updated successfully, but these errors were encountered:

kelly-sovacool · 2023-01-18T22:10:59Z

A quick proof of concept:

import pandas as pd
import yaml
from snakemake.utils import Paramspace

with open('config/robust.yml', 'r') as infile:
    config = yaml.load(infile, Loader=yaml.Loader)

ignore_keys = ['dataset-csv', 'outcome-colname', 'hyperparams', 'find-feature-importance', 'nseeds', 'ncores']
for k in ignore_keys:
    config.pop(k)
config_df = pd.DataFrame.from_dict(config)
 
paramspace = Paramspace(config_df, param_sep = "_")
print('paramspace.wildcard_pattern:\t', paramspace.wildcard_pattern)
print('paramspace.instance_patterns:\t', [i for i in paramspace.instance_patterns])

output:

paramspace.wildcard_pattern:     dataset-name_{dataset-name}/ml-methods_{ml-methods}/kfold_{kfold}
paramspace.instance_patterns:    ['_0_otu-large/_1_glmnet/kfold_5', '_0_otu-large/_1_rf/kfold_5', '_0_otu-large/_1_rpart2/kfold_5', '_0_otu-large/_1_svmRadial/kfold_5']

Paramspace class doc

kelly-sovacool · 2023-01-20T22:09:11Z

Now with permutations of lists in config, similar to R's param.grid().

from itertools import product
import pandas as pd
from snakemake.utils import Paramspace
import yaml

with open('config/robust.yml', 'r') as infile:
    config = yaml.load(infile, Loader=yaml.Loader)

ignore_keys = ['dataset_csv', 'outcome_colname', 'hyperparams', 'find_feature_importance', 'ncores', 'nseeds']
for k in ignore_keys:
    config.pop(k, None)

config['seed'] = list(range(100, 102))
conf_lists = {k:v for k,v in config.items() if type(v) == list}
params_df = pd.DataFrame(list(product(*[v for v in conf_lists.values()])), columns = conf_lists.keys())
for k in conf_lists.keys():
    config.pop(k)
for k, v in config.items():
    params_df[k] = v

paramspace = Paramspace(params_df, param_sep = "_")
print('paramspace.wildcard_pattern:\t', paramspace.wildcard_pattern)
print('paramspace.instance_patterns:\t', [i for i in paramspace.instance_patterns])

output:

paramspace.wildcard_pattern:     ml_methods_{ml_methods}/seed_{seed}/dataset_name_{dataset_name}/kfold_{kfold}
paramspace.instance_patterns:    ['ml_methods_glmnet/seed_100/dataset_name_otu_large/kfold_5', 'ml_methods_glmnet/seed_101/dataset_name_otu_large/kfold_5', 'ml_methods_rf/seed_100/dataset_name_otu_large/kfold_5', 'ml_methods_rf/seed_101/dataset_name_otu_large/kfold_5', 'ml_methods_rpart2/seed_100/dataset_name_otu_large/kfold_5', 'ml_methods_rpart2/seed_101/dataset_name_otu_large/kfold_5', 'ml_methods_svmRadial/seed_100/dataset_name_otu_large/kfold_5', 'ml_methods_svmRadial/seed_101/dataset_name_otu_large/kfold_5']

kelly-sovacool added a commit that referenced this issue Jan 18, 2023

Quick proof of concept for paramspace (#36)

754329d

kelly-sovacool added the feature A new feature request or enhancement label Jan 18, 2023

kelly-sovacool linked a pull request Jan 28, 2023 that will close this issue

Use Paramspace to automate the file naming scheme based on wildcards #40

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Paramspace to automate file naming scheme based on wildcards #36

Use Paramspace to automate file naming scheme based on wildcards #36

kelly-sovacool commented Jan 18, 2023 •

edited

kelly-sovacool commented Jan 18, 2023

kelly-sovacool commented Jan 20, 2023

Use Paramspace to automate file naming scheme based on wildcards #36

Use Paramspace to automate file naming scheme based on wildcards #36

Comments

kelly-sovacool commented Jan 18, 2023 • edited

TODO

kelly-sovacool commented Jan 18, 2023

kelly-sovacool commented Jan 20, 2023

kelly-sovacool commented Jan 18, 2023 •

edited