Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a fixed "large" projection matrix per tree instead of at each node #237

Open
adam2392 opened this issue Mar 1, 2024 · 0 comments
Labels
research Requires experimentation, theory and research.

Comments

@adam2392
Copy link
Collaborator

adam2392 commented Mar 1, 2024

Currently, at each node a completely new projection array is sampled (max_features, n_features, n_dims_projection), where X is (n_samples, n_features). Each (n_features, n_dims_projection) is a new projection matrix. In practice, we do this in a "sparse" way, so that we store the feature-index and the projection-weight to apply.

This ends up consuming a lot of RAM (and possibly resulting in segfaults, tho I'm unsure why). See related: #226 and #215 .

Another sensible strategy is sample a LARGE projection array per tree (LARGE_MAX_FEATURES, n_features, n_dims_projection), which then considers a random subset of max_features many projection matrices to apply at each split node. This amortizes the sampling of the projection matrix and only does it once per tree. Tho depending on how large LARGE_MAX_FEATURES is, we would have to store a huge array in RAM.

There are some perceivable benefits tho:

  1. we can track constants in the projection: We simply have to keep track of a vector (LARGE_MAX_FEATURES,) long, and we can determine if at any point in the tree, splitting on that projection vector results in no change in the impurity. We therefore would skip that projection vector
  2. For deep trees, this can result in considerable runtime improvement
  3. Assuming we can eat the cost of the initial RAM when sampling the large projection matrix, we wont' have large RAM spikes due to sampling a bunch of new projection arrays per split node.
  4. This could allow the user to specify the large projection array in Python and pass it in! This will allow an easy testing of the Gabor and Fourier kernel ideas because specifying a projection matrix in Python for these complex projections will be a lot easier.

The open question tho is... what is LARGE_MAX_FEATURES? Should it be max_features * 100, max_features * 10, max_features * <some hyperparameter>?

cc: @jovo @j1c

@adam2392 adam2392 added the research Requires experimentation, theory and research. label Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Requires experimentation, theory and research.
Projects
None yet
Development

No branches or pull requests

1 participant