Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse / slow #226

Open
jovo opened this issue Feb 21, 2024 · 10 comments
Open

sparse / slow #226

jovo opened this issue Feb 21, 2024 · 10 comments

Comments

@jovo
Copy link
Member

jovo commented Feb 21, 2024

@adam2392 When @jdey4 runs MORF, it takes a long time and is slow. When we build the projection matrix in oblique trees, is it a sparse matrix? If not, can we make it be a sparse matrix, I believe if it is not a sparse matrix format, but it is a sparse matrix, then we can save a lot of RAM and time by using sparse.

@adam2392
Copy link
Collaborator

I asked @jdey4 if he could post a GH issue, so I'm unsure how he's running things. It is true that MORF is not very well tested and benchmarked currently.

cdef void sample_proj_mat(
self,
vector[vector[float32_t]]& proj_mat_weights,
vector[vector[intp_t]]& proj_mat_indices
) noexcept nogil:
"""Sample projection matrix using a contiguous patch.
Randomly sample patches with weight of 1.
"""
cdef intp_t max_features = self.max_features
cdef intp_t proj_i
# define parameters for vectorized points in the original data shape
# and top-left seed
cdef intp_t top_left_patch_seed
# size of the sampled patch, which is just the size of the n-dim patch
# (\prod_i self.patch_dims_buff[i])
cdef intp_t patch_size
for proj_i in range(0, max_features):
# now get the top-left seed that is used to then determine the top-left
# position in patch
# compute top-left seed for the multi-dimensional patch
top_left_patch_seed, patch_size = self.sample_top_left_seed()
# sample a projection vector given the top-left seed point in n-dimensional space
self.sample_proj_vec(
proj_mat_weights,
proj_mat_indices,
proj_i,
patch_size,
top_left_patch_seed,
self.patch_dims_buff
)
shows that the projection matrix is sparse format of handling a vector of their feature indices and vector of weights. Only non-zero weights are stored.

@adam2392
Copy link
Collaborator

Possibly something for Edward's team et al. to consider? @jovo

It would be nice to have some measure of performance that we can run from n_samples 100 to >> 100.

@jdey4
Copy link
Member

jdey4 commented Feb 23, 2024

Hi @adam2392 I used the following code snippet to train sporf:
x_train, x_test, y_train, y_test = train_test_split(
X, y, train_size=0.6, random_state=0, stratify=y)
clf_sporf = ObliqueRandomForestClassifier(n_estimators=100, max_features=20)

X has a shape of (2368, 3498706) and the above code runs fine, takes about 50 mins to train. But if I increase the max feature to 100, it breaks my RAM (64 GB, apple M1 Max).

@adam2392
Copy link
Collaborator

Ah I see. That's interesting. I wouldn't expect that to happen. How many trees are you training simultaneously?

@jdey4
Copy link
Member

jdey4 commented Feb 23, 2024

For now I am using 100 trees, but I would love to use 1000 trees.

@adam2392
Copy link
Collaborator

Sorry I am asking how many jobs are you training in parallel. I.e. if you're training 100 trees in parallel, I am less surprised that you're running out of RAM

@jdey4
Copy link
Member

jdey4 commented Feb 23, 2024

Ah, I was using the default parameters, for which n_jobs=None.

@adam2392
Copy link
Collaborator

Ah I see... that is then training 1 tree at a time. Can you inform:

  1. How deep is one tree?
  2. If you do clf.estimators_[0].tree_.get_projection_matrix(), what is an example of a projection matrix (maybe do a heat map(?) and what is the shape?

@jdey4
Copy link
Member

jdey4 commented Mar 7, 2024

I tried MORF on brain MRI data with X.shape=(2206, 3498706). The server that I used has 754 GB of memory. I used only 1 worker and the code broke. When I try to fit MORF with 100 features, it works.

@jdey4
Copy link
Member

jdey4 commented Mar 7, 2024

@adam2392 here is my code snippet. It works for max_patch_dims=(3,3,3).
Screenshot 2024-03-07 at 5 44 27 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants