-
-
Notifications
You must be signed in to change notification settings - Fork 960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested studies for Nested Cross-Validation Support #5321
Comments
Thank you for feature request. Do you think #5118 is the same as this request? |
@nzw0301 Thank you for your reply. #5118 seems to have a confusing objective, and the answers seem to solve the issue using features already implemented in Optuna. The proposed feature here might also help in #5118 , but I would consider this as an independent issue. See below a diagram of the proposed structure. The role of Parent Wrapper is to allow the comparison between different runs of nested cross-validation, keeping the study database table organized and the optuna-dashboard tidy. Maybe we can adopt the term "Experiment" for this new class? The results of individual Studies are irrelevant when assessing the results of nested cross-validation, as we are interested in the overall performance of all studies (i.e mean accuracy of the best model of each Study). The "Experiment" class could implement a method I think users should not be forced to create an Experiment to be able to create Studies, as this would alter current implementations with Optuna and many users won't need Experiments at all. Instead, an Experiment object could be created and passed when creating Study objects. The Study database table could then store a Foreign Key to the parent Experiment. See below an example usage: import optuna
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
'''
Suppose that `X` and `y` are the input features and
input labels, respectively.
Suppose that `pipeline` is an sklearn Pipeline.
'''
search_spaces = {
'param_1': optuna.distributions.FloatDistribution(0.001, 0.4, log=True),
'param_2': optuna.distributions.IntDistribution(10, 500),
}
storage = 'sqlite:///example-storage.db'
# Here, we create the new Experiment object
experiment = Experiment(
experiment_name = 'nestedCV-1',
storage = storage
)
outer_scores = []
cv_outer = KFold()
for split, (train_idx, test_idx) in enumerate(cv_outer.split(X, y)):
study = optuna.create_study(
study_name=f'study_{split}',
storage=storage,
direction='maximize',
experiment=experiment # We assign this study to the experiment object we defined
)
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
def objective(trial):
trial_params = {key: trial._suggest(key, dist) for key, dist in search_spaces.items()}
pipeline.set_params(**trial_params)
cv_inner = KFold()
scores = cross_val_score(pipeline, X_train, y_train, cv=cv_inner)
return scores.mean()
study.optimize(objective, n_trials=100)
best_params = study.best_params
pipeline.set_params(**best_params)
pipeline.fit(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
outer_scores.append(test_score)
experiment.set_metric('Mean Accuracy', np.mean(outer_scores)) # We add the overall performance metric (similar to `set_user_attr()`) |
Motivation
When training a model using a nested cross-validation approach, different studies are created for each split of the outer cross-validation. However, it would be interesting to organize all these studies into a "Parent Study" showing the overall result of the experiment (i.e Mean Accuracy of all outer splits).
Description
This "Parent Study" could work as a wrapper of all children studies, showing averaged metrics from all the children studies and giving an overview of the Nested Cross-Validation results. Therefore, The Parent Study may be implemented using a separate wrapper class containing a list of children and user-defined averaged results.
In the optuna-dashboard, the Parent Study could be displayed as a folder or as a dropdown in the navbar.
Alternatives (optional)
The current straight-forward approach is to collect all outer split results into a single study and create Attributes showing the results. This is suboptimal, as all the different splits are shown together in the training and hyperparameter graphs, hiding the fact that every outer split is completely independent and making the graphs unusable.
Another solution is to build scripts to edit the optuna study database, which forces the user to abandon the optuna framework.
Additional context (optional)
No response
The text was updated successfully, but these errors were encountered: