sparse / slow #226

jovo · 2024-02-21T20:13:34Z

@adam2392 When @jdey4 runs MORF, it takes a long time and is slow. When we build the projection matrix in oblique trees, is it a sparse matrix? If not, can we make it be a sparse matrix, I believe if it is not a sparse matrix format, but it is a sparse matrix, then we can save a lot of RAM and time by using sparse.

adam2392 · 2024-02-21T21:00:10Z

I asked @jdey4 if he could post a GH issue, so I'm unsure how he's running things. It is true that MORF is not very well tested and benchmarked currently.

scikit-tree/sktree/tree/manifold/_morf_splitter.pyx

Lines 273 to 307 in 95e2597

    
               cdef void sample_proj_mat( 
        
                   self, 
        
                   vector[vector[float32_t]]& proj_mat_weights, 
        
                   vector[vector[intp_t]]& proj_mat_indices 
        
               ) noexcept nogil: 
        
                   """Sample projection matrix using a contiguous patch. 
        
                   Randomly sample patches with weight of 1. 
        
                   """ 
        
                   cdef intp_t max_features = self.max_features 
        
                   cdef intp_t proj_i 
        
                   # define parameters for vectorized points in the original data shape 
        
                   # and top-left seed 
        
                   cdef intp_t top_left_patch_seed 
        
                   # size of the sampled patch, which is just the size of the n-dim patch 
        
                   # (\prod_i self.patch_dims_buff[i]) 
        
                   cdef intp_t patch_size 
        
                   for proj_i in range(0, max_features): 
        
                       # now get the top-left seed that is used to then determine the top-left 
        
                       # position in patch 
        
                       # compute top-left seed for the multi-dimensional patch 
        
                       top_left_patch_seed, patch_size = self.sample_top_left_seed() 
        
                       # sample a projection vector given the top-left seed point in n-dimensional space 
        
                       self.sample_proj_vec( 
        
                           proj_mat_weights, 
        
                           proj_mat_indices, 
        
                           proj_i, 
        
                           patch_size, 
        
                           top_left_patch_seed, 
        
                           self.patch_dims_buff 
        
                       )

shows that the projection matrix is sparse format of handling a vector of their feature indices and vector of weights. Only non-zero weights are stored.

adam2392 · 2024-02-22T20:11:38Z

Possibly something for Edward's team et al. to consider? @jovo

It would be nice to have some measure of performance that we can run from n_samples 100 to >> 100.

jdey4 · 2024-02-23T16:58:44Z

Hi @adam2392 I used the following code snippet to train sporf:
x_train, x_test, y_train, y_test = train_test_split(
X, y, train_size=0.6, random_state=0, stratify=y)
clf_sporf = ObliqueRandomForestClassifier(n_estimators=100, max_features=20)

X has a shape of (2368, 3498706) and the above code runs fine, takes about 50 mins to train. But if I increase the max feature to 100, it breaks my RAM (64 GB, apple M1 Max).

adam2392 · 2024-02-23T17:20:14Z

Ah I see. That's interesting. I wouldn't expect that to happen. How many trees are you training simultaneously?

jdey4 · 2024-02-23T17:21:58Z

For now I am using 100 trees, but I would love to use 1000 trees.

adam2392 · 2024-02-23T17:33:15Z

Sorry I am asking how many jobs are you training in parallel. I.e. if you're training 100 trees in parallel, I am less surprised that you're running out of RAM

jdey4 · 2024-02-23T17:41:46Z

Ah, I was using the default parameters, for which n_jobs=None.

adam2392 · 2024-02-23T20:12:11Z

Ah I see... that is then training 1 tree at a time. Can you inform:

How deep is one tree?
If you do clf.estimators_[0].tree_.get_projection_matrix(), what is an example of a projection matrix (maybe do a heat map(?) and what is the shape?

jdey4 · 2024-03-07T21:03:02Z

I tried MORF on brain MRI data with X.shape=(2206, 3498706). The server that I used has 754 GB of memory. I used only 1 worker and the code broke. When I try to fit MORF with 100 features, it works.

jdey4 · 2024-03-07T22:46:13Z

@adam2392 here is my code snippet. It works for max_patch_dims=(3,3,3).

adam2392 mentioned this issue Mar 1, 2024

Creating a fixed "large" projection matrix per tree instead of at each node #237

Open

This was referenced Mar 10, 2024

ENH Multiview axis-aligned more like sklearn #241

Closed

MAINT Adding meson compiler directives #242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse / slow #226

sparse / slow #226

jovo commented Feb 21, 2024

adam2392 commented Feb 21, 2024

adam2392 commented Feb 22, 2024

jdey4 commented Feb 23, 2024

adam2392 commented Feb 23, 2024

jdey4 commented Feb 23, 2024

adam2392 commented Feb 23, 2024

jdey4 commented Feb 23, 2024

adam2392 commented Feb 23, 2024

jdey4 commented Mar 7, 2024

jdey4 commented Mar 7, 2024

sparse / slow #226

sparse / slow #226

Comments

jovo commented Feb 21, 2024

adam2392 commented Feb 21, 2024

adam2392 commented Feb 22, 2024

jdey4 commented Feb 23, 2024

adam2392 commented Feb 23, 2024

jdey4 commented Feb 23, 2024

adam2392 commented Feb 23, 2024

jdey4 commented Feb 23, 2024

adam2392 commented Feb 23, 2024

jdey4 commented Mar 7, 2024

jdey4 commented Mar 7, 2024