Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

take() got an unexpected keyword argument 'axis' #84

Open
JiaLeXian opened this issue May 29, 2021 · 5 comments
Open

take() got an unexpected keyword argument 'axis' #84

JiaLeXian opened this issue May 29, 2021 · 5 comments
Labels
enhancement Miscellaneous improvements

Comments

@JiaLeXian
Copy link

Got error with code:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from deepforest import CascadeForestClassifier

model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)


TypeError Traceback (most recent call last)
in
6
7 model = CascadeForestClassifier(random_state=1)
----> 8 model.fit(X_train, y_train.values.ravel())

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/cascade.py in fit(self, X, y, sample_weight)
1395 y = self._encode_class_labels(y)
1396
-> 1397 super().fit(X, y, sample_weight)
1398
1399 def predict_proba(self, X):

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/cascade.py in fit(self, X, y, sample_weight)
754
755 # Bin the training data
--> 756 X_train_ = self.bin_data(binner, X, is_training_data=True)
757 X_train_ = self.buffer_.cache_data(0, X_train_, is_training_data=True)
758

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/cascade.py in _bin_data(self, binner, X, is_training_data)
665 tic = time.time()
666 if is_training_data:
--> 667 X_binned = binner.fit_transform(X)
668 else:
669 X_binned = binner.transform(X)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
697 if y is None:
698 # fit method of arity 1 (unsupervised transformation)
--> 699 return self.fit(X, **fit_params).transform(X)
700 else:
701 # fit method of arity 2 (supervised transformation)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/_binner.py in fit(self, X)
128 self.validate_params()
129
--> 130 self.bin_thresholds
= _find_binning_thresholds(
131 X,
132 self.n_bins - 1,

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/deepforest/_binner.py in _find_binning_thresholds(X, n_bins, bin_subsample, bin_type, random_state)
75 if n_samples > bin_subsample:
76 subset = rng.choice(np.arange(n_samples), bin_subsample, replace=False)
---> 77 X = X.take(subset, axis=0)
78
79 binning_thresholds = []

TypeError: take() got an unexpected keyword argument 'axis'

Dataset is loaded with vaex, is this a problem particular for vaex?

@xuyxu
Copy link
Member

xuyxu commented May 29, 2021

Hi @JiaLeXian, thanks for reporting! I will take a look at vaex when get a moment. For now, you can manually convert your data into the form of numpy array in order to use deep fprest.

@xuyxu xuyxu added the needtriage Further information is requested label May 29, 2021
@xuyxu
Copy link
Member

xuyxu commented May 30, 2021

It looks like vaex does not support slicing (vaexio/vaex#911), which is an essential operation in deep forest, e.g., bootstrap sampling when building random forests. At least for now, this problem cannot be solved :-(

Thanks for reporting anyway.

@xuyxu xuyxu added wontfix This will not be worked on and removed needtriage Further information is requested labels May 30, 2021
@JiaLeXian
Copy link
Author

Hi @xuyxu, thanks for investigating the problem. Appreciated! So, for DF, it's best to use numpy array or original pandas dataframe?

In our case, we have more than 100 million rows of data. That's why we use vaex to load the data to reduce memory occupation. We still want to try DF on our dataset. We will explore other ways to try. Thank you!

@xuyxu
Copy link
Member

xuyxu commented Jun 2, 2021

Could you take a look at numpy.memmap, it looks like there is also no need to load the entire dataset into the memory with memmap.

Besides, feel free to tell me if you have any problem when trying out this solution ;-). We are willing to further improve the functionality of DF when faced with such large datasets.

@xuyxu xuyxu added enhancement Miscellaneous improvements and removed wontfix This will not be worked on labels Jun 2, 2021
@JiaLeXian
Copy link
Author

@xuyxu thanks for the quick reply. Thanks for suggesting numpy.memmap. We will try this option in the following days. Will keep you posted. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Miscellaneous improvements
Projects
None yet
Development

No branches or pull requests

2 participants