Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model crashes for very small data #103

Closed
Alaya-in-Matrix opened this issue Jan 7, 2022 · 5 comments
Closed

Model crashes for very small data #103

Alaya-in-Matrix opened this issue Jan 7, 2022 · 5 comments
Labels
feature request New feature or request

Comments

@Alaya-in-Matrix
Copy link

When experimenting the DF model with toy dataset, I found that the model fitting crashes when the training size is very small, the below code can be used to reproduce this bug

import numpy as np
import matplotlib.pyplot as plt
from deepforest import CascadeForestRegressor

np.random.seed(0)
#xs = np.random.randn(10,1)
xs = np.random.randn(5,1)

ys = np.sinc(xs).reshape(-1)

model = CascadeForestRegressor(verbose = 1, random_state = 0)
model.fit(xs,ys)

Below is the error message;

[2022-01-07 03:05:04.750] Start to fit the model:
[2022-01-07 03:05:04.751] Fitting cascade layer = 0 
[2022-01-07 03:05:05.213] layer = 0  | Val MSE = 0.16421 | Elapsed = 0.463 s
[2022-01-07 03:05:05.214] Fitting cascade layer = 1 
[2022-01-07 03:05:05.662] layer = 1  | Val MSE = 0.18319 | Elapsed = 0.448 s
[2022-01-07 03:05:05.663] Early stopping counter: 1 out of 2
[2022-01-07 03:05:05.663] Fitting cascade layer = 2 
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
C:\Users\L00517~1\AppData\Local\Temp/ipykernel_20764/2130250665.py in <module>
      1 model = CascadeForestRegressor(verbose = 1, random_state = 0)
----> 2 model.fit(xs,ys)

D:\Anaconda\lib\site-packages\deepforest\cascade.py in fit(self, X, y, sample_weight)
   1594         self._check_target_values(y)
   1595 
-> 1596         super().fit(X, y, sample_weight)
   1597 
   1598     def predict(self, X):

D:\Anaconda\lib\site-packages\deepforest\cascade.py in fit(self, X, y, sample_weight)
    866 
    867             tic = time.time()
--> 868             X_aug_train_ = layer_.fit_transform(
    869                 X_middle_train_, y, sample_weight=sample_weight
    870             )

D:\Anaconda\lib\site-packages\deepforest\_layer.py in fit_transform(self, X, y, sample_weight)
    295         # A random forest and an extremely random forest will be fitted
    296         for estimator_idx in range(self.n_estimators // 2):
--> 297             X_aug_, _estimator = _build_estimator(
    298                 X,
    299                 y,

D:\Anaconda\lib\site-packages\deepforest\_layer.py in _build_estimator(X, y, layer_idx, estimator_idx, estimator_name, estimator, oob_decision_function, partial_mode, buffer, verbose, sample_weight)
     38         print(msg.format(_utils.ctime(), key, layer_idx))
     39 
---> 40     X_aug_train = estimator.fit_transform(X, y, sample_weight)
     41     oob_decision_function += estimator.oob_decision_function_
     42 

D:\Anaconda\lib\site-packages\deepforest\_estimator.py in fit_transform(self, X, y, sample_weight)
    197 
    198     def fit_transform(self, X, y, sample_weight=None):
--> 199         self.estimator_.fit(X, y, sample_weight)
    200         return self.oob_decision_function_
    201 

D:\Anaconda\lib\site-packages\deepforest\forest.py in fit(self, X, y, sample_weight)
    461 
    462         lock = threading.Lock()
--> 463         rets = Parallel(
    464             n_jobs=n_jobs,
    465             verbose=self.verbose,

D:\Anaconda\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1044                 self._iterating = self._original_iterator is not None
   1045 
-> 1046             while self.dispatch_one_batch(iterator):
   1047                 pass
   1048 

D:\Anaconda\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    859                 return False
    860             else:
--> 861                 self._dispatch(tasks)
    862                 return True
    863 

D:\Anaconda\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    777         with self._lock:
    778             job_idx = len(self._jobs)
--> 779             job = self._backend.apply_async(batch, callback=cb)
    780             # A job can complete so quickly than its callback is
    781             # called before we get here, causing self._jobs to

D:\Anaconda\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

D:\Anaconda\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

D:\Anaconda\lib\site-packages\joblib\parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

D:\Anaconda\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

D:\Anaconda\lib\site-packages\deepforest\forest.py in _parallel_build_trees(tree, X, y, n_samples_bootstrap, sample_weight, out, mask, is_classifier, lock)
    123     if sample_weight is not None:
    124         sample_weight = sample_weight[sample_mask]
--> 125     feature, threshold, children, value = tree.fit(
    126         X[sample_mask],
    127         y[sample_mask],

D:\Anaconda\lib\site-packages\deepforest\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    534     ):
    535 
--> 536         return super().fit(
    537             X,
    538             y,

D:\Anaconda\lib\site-packages\deepforest\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    374         )
    375 
--> 376         builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
    377 
    378         if self.n_outputs_ == 1 and is_classifier(self):

deepforest\tree\_tree.pyx in deepforest.tree._tree.DepthFirstTreeBuilder.build()

deepforest\tree\_tree.pyx in deepforest.tree._tree.DepthFirstTreeBuilder.build()

deepforest\tree\_tree.pyx in deepforest.tree._tree.Tree._resize_node_c()

deepforest\tree\_utils.pyx in deepforest.tree._utils.safe_realloc()

MemoryError: could not allocate 0 bytes
@xuyxu
Copy link
Member

xuyxu commented Jan 7, 2022

Hi @Alaya-in-Matrix, what is the version of numpy installed.

@xuyxu xuyxu added the needtriage Further information is requested label Jan 7, 2022
@Alaya-in-Matrix
Copy link
Author

@xuyxu It's numpy 1.19.5

deep-forest==0.1.5
joblib==1.1.0
numpy==1.19.5
scikit-learn==1.0.2
scipy==1.7.3
threadpoolctl==3.0.0

@xuyxu
Copy link
Member

xuyxu commented Jan 16, 2022

The regression demo works fine with your package environment:

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from deepforest import CascadeForestRegressor

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestRegressor(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("\nTesting MSE: {:.3f}".format(mse))

In addition, the CascadeForestRegressor also works fine if using xs = np.random.randn(10,1) in your code snippet. The exception possibly is caused by using a toy dataset with too few number of samples.

@xuyxu xuyxu added wontfix This will not be worked on and removed needtriage Further information is requested labels Jan 16, 2022
@Alaya-in-Matrix
Copy link
Author

@xuyxu That's exactly what I'm reporting, why would DF work for xs = np.random.randn(10,1) but not work for randn(5,1)? that makes no sense

@xuyxu xuyxu mentioned this issue Jan 17, 2022
13 tasks
@xuyxu xuyxu added feature request New feature or request and removed wontfix This will not be worked on labels Jan 17, 2022
@xuyxu
Copy link
Member

xuyxu commented Jan 21, 2022

Closed via #14

@xuyxu xuyxu closed this as completed Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants