Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement to multivariate TPESampler when the union search space is different from the intersection search space #5299

Open
larsbratholm opened this issue Mar 7, 2024 · 0 comments
Assignees
Labels
feature Change that does not break compatibility, but affects the public interfaces.

Comments

@larsbratholm
Copy link

Motivation

During hyper-parameter optimization I often want to find the best of a few different methods, all of which can have a different set of parameters. To model this with the multivariate TPESampler we're required to use the group decomposition, which samples each group independently of each other. E.g. in a simple objective like the following:

def objective(trial):
    x = trial.suggest_categorical("x", ["A", "B"])
    t = trial.suggest_float("t", -2, 2)
    if x == "A":
        y = trial.suggest_float("y", 1, 2)
        return (y - t)**2
    else:
        z = trial.suggest_float("z", -2, 1)
        return (z - t)**2

the model will struggle to converge as t is heavily correlated with y and z, but they are modeled independently of each other. This is a very simple example, but highlights some issues with correlation between groups.

Description

The cleanest way I could think off to avoid the group decomposition is to replace the intersection search space used in the multivariate TPESampler with a union search space. The internal representation of 'missing' parameters can be set to e.g. 'nan' which is then handled in the ParzenEstimator and _MixtureOfProductDistribution.
During sampling, we can just set a parameter in the sample to 'nan' if the active index had 'nan' values as well. In the likelihood estimate for categorical parameters, we can just introduce 'nan' as an additional category. For integers and floats, we can multiply the univariate distribution with a catagorical one, estimating the likelihood of the parameter being sampled or not.
Of course due to priors, this can result in some required parameters not being sampled, but these can be sampled with an independent tpe sampler.

This doesn't require huge changes in the code, and shouldn't change anything on the user end. If it performs well across tests, it should be able to replace the default {multivariate=True, group=False} behavior, as the behavior is only different from the current one when the intersection search space is different from the union search space.

Alternatives (optional)

Alternatively we can continue using the group decomposition, but instead determine the parameter hierarchy in the objective. I.e. in the above example, (x, t) is always sampled, so we can draw n_ei_candidates samples for them first using all trials. Then based on their values, we can either sample y or z from the subset of trials that contain these values. This requires some classifier which can either be given as user input, or be determined automatically by a simple decision tree. Then once all the required parameters have been sampled, we can choose the best sample from their collective likelihoods.
This only adds in correlation between the parameters that are part of the objective decision tree, but does help the sampler avoid getting stuck in local minima. However, the changes doesn't mesh well with the TPE sampler code, and would likely have to be implemented as a separate sampler.

Additional context (optional)

I made implementations of both approaches here: https://github.com/larsbratholm/optuna_tpe_suggestion/tree/main
as well as a benchmark on a slightly more complex objective shown here: https://github.com/larsbratholm/optuna_tpe_suggestion/blob/main/run_benchmark.py#L19

The geometric mean loss over 128 runs for RandomSampler, TPE, multivariate TPE, the non-group multivariate approach suggested, as well as the alternative hierarchical tpe sampler are shown below
benchmark

Both methods perform well on the objective, but the objective features high correlation and a local minima that is easy to get stuck in, so might not be representative overall?

Are either of these approaches something that would be welcome as a contribution to Optuna? And are there other objectives that have been used to benchmark the group decomposed TPESampler that would be helpful to test on?

@larsbratholm larsbratholm added the feature Change that does not break compatibility, but affects the public interfaces. label Mar 7, 2024
@nabenabe0928 nabenabe0928 self-assigned this May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Change that does not break compatibility, but affects the public interfaces.
Projects
None yet
Development

No branches or pull requests

2 participants