Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about random forest surrogate and high dimensional data #64

Open
wushanyun64 opened this issue Apr 25, 2023 · 3 comments
Open

Comments

@wushanyun64
Copy link

Hi Openbox,

I have two quick questions that would really appreciate if you guys could help me better understand.

  1. For the random forest surrogate model, what is the method that openbox use to get the variance of the prediction? Based on the source code I see, it seems to bag the random forest model with different random state and take the variance of the bag. I am wondering if this is a statistically valid method and if so are there any reference talking about this approach?
  2. I see some other packages use a jackknife to estimate the variance of RF, are you guys considering implementing this?
  3. Can Openbox handle high-dimensional search space?

Thanks in advance,

@jhj0411jhj
Copy link
Member

For your questions,

  1. The probabilistic random forest is proposed in SMAC[1] (source code: https://github.com/automl/random_forest_run). The PRF model computes predicted mean/variance of its inner regression trees, not mean/var of bag of RFs. (The source code you see might be a deprecated version. The correct one is here: https://github.com/PKU-DAIR/open-box/blob/master/openbox/surrogate/base/rf_with_instances.py)
    [1] Hutter, Frank, Holger H. Hoos, and Kevin Leyton-Brown. "Sequential model-based optimization for general algorithm configuration." Learning and Intelligent Optimization: 5th International Conference, LION 5, Rome, Italy, January 17-21, 2011. Selected Papers 5. Springer Berlin Heidelberg, 2011.

  2. Do you mean sampling multiple sets of training data and training different RFs? Could you provide the package names or papers? We will have a look into it but may not implement it in OpenBox since PRF works pretty well now.

  3. Vanilla BO suffers from the curse of dimensionality. There are several new methods (e.g. [2][3]) proposed to enhance BO in high-dimensional problems. However, algorithms for high-dimensional optimization are still under development in OpenBox.
    [2] Eriksson, David, et al. "Scalable global optimization via local bayesian optimization." Advances in neural information processing systems 32 (2019).
    [3] Wang, Linnan, Rodrigo Fonseca, and Yuandong Tian. "Learning search space partition for black-box optimization using monte carlo tree search." Advances in Neural Information Processing Systems 33 (2020): 19511-19522.

@wushanyun64
Copy link
Author

Thank you for the explanation, here is the paper I was talking about for the second point:
https://jmlr.org/papers/volume15/wager14a/wager14a.pdf

About the third question, my understanding is that using RF instead of GP will partially solve the high-dimension problem, since GP is typically not good for high-dimension data. If we choose RF as the surrogate, what other problem we'll encounter when dealing with high dimensional data?

@jhj0411jhj
Copy link
Member

Hi @wushanyun64, RF is better than GP in high-dimension problem. However, since data points are scarce compared to the high dimensionality of the space, traditional BO method tends to over explore the space and may behave like random search. Some high-dim methods tackle this problem by restricting the search in local area of existing points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants