Creating a fixed "large" projection matrix per tree instead of at each node #237

adam2392 · 2024-03-01T16:36:01Z

Currently, at each node a completely new projection array is sampled (max_features, n_features, n_dims_projection), where X is (n_samples, n_features). Each (n_features, n_dims_projection) is a new projection matrix. In practice, we do this in a "sparse" way, so that we store the feature-index and the projection-weight to apply.

This ends up consuming a lot of RAM (and possibly resulting in segfaults, tho I'm unsure why). See related: #226 and #215 .

Another sensible strategy is sample a LARGE projection array per tree (LARGE_MAX_FEATURES, n_features, n_dims_projection), which then considers a random subset of max_features many projection matrices to apply at each split node. This amortizes the sampling of the projection matrix and only does it once per tree. Tho depending on how large LARGE_MAX_FEATURES is, we would have to store a huge array in RAM.

There are some perceivable benefits tho:

we can track constants in the projection: We simply have to keep track of a vector (LARGE_MAX_FEATURES,) long, and we can determine if at any point in the tree, splitting on that projection vector results in no change in the impurity. We therefore would skip that projection vector
For deep trees, this can result in considerable runtime improvement
Assuming we can eat the cost of the initial RAM when sampling the large projection matrix, we wont' have large RAM spikes due to sampling a bunch of new projection arrays per split node.
This could allow the user to specify the large projection array in Python and pass it in! This will allow an easy testing of the Gabor and Fourier kernel ideas because specifying a projection matrix in Python for these complex projections will be a lot easier.

The open question tho is... what is LARGE_MAX_FEATURES? Should it be max_features * 100, max_features * 10, max_features * <some hyperparameter>?

cc: @jovo @j1c

The text was updated successfully, but these errors were encountered:

adam2392 added the research Requires experimentation, theory and research. label Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a fixed "large" projection matrix per tree instead of at each node #237

Creating a fixed "large" projection matrix per tree instead of at each node #237

adam2392 commented Mar 1, 2024

Creating a fixed "large" projection matrix per tree instead of at each node #237

Creating a fixed "large" projection matrix per tree instead of at each node #237

Comments

adam2392 commented Mar 1, 2024