-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering does not accept input from CountVectorizer or TfidfVectorizer #5807
Comments
Thanks for the issue @erico-imgproj! I notice that you're using We'll also work on solving the issue you're seeing with Scikit's CountVectorizer, it should work as well. |
Hi @dantegd I tested both, but the error is still there. The lines up to the generation of the features come from an example available at the CUML website. It should work. The issue seems to be that the output of the CountVectorizer nor TFIDFVectorizer are not recognized by the clustering algorithms. If you try to run classification tasks, they will work fine. |
This also happens for CUML's RandomForest. However, models like Naive Bayes and SVC do work in the same setup. Is there a specific reason why RF can't deal with the csr_matrix? import time
import cudf
import cupy as cp
import numpy as np
# from xgboost import XGBClassifier
from cuml.dask.common import to_sparse_dask_array
from cuml.ensemble import RandomForestClassifier
# from dask_ml.feature_extraction.text import HashingVectorizer
from cuml.feature_extraction.text import CountVectorizer, HashingVectorizer
# from cuml.dask.naive_bayes import MultinomialNB as cuNB
from cuml.naive_bayes import MultinomialNB as cuNB
from cuml.svm import SVC as cuSVC
from cupyx.scipy.sparse import csr_matrix
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from sklearn.datasets import fetch_20newsgroups
# Create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)
# Load corpus
twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42)
twenty_train = cudf.DataFrame.from_dict(
{"data": twenty_train.data, "target": twenty_train.target}
)
cv = HashingVectorizer()
xformed = cv.fit_transform(twenty_train.data).astype(np.float32)
X = csr_matrix(xformed).astype(cp.float32)
y = cp.asarray(twenty_train.target).astype(cp.int32)
from cuml.ensemble import RandomForestClassifier as cuRF
# Try NB
model = cuNB()
start = time.time()
model.fit(X, y) # works
end = time.time()
print("Time to train: ", end - start)
# Try RF
model = cuRF()
start = time.time()
model.fit(X, y) # fails
end = time.time()
print("Time to train: ", end - start) I get the same errors as @erico-imgproj. |
Describe the bug
NLP clustering does not work properly. The code available in example works fine for classification tasks, but the clustering does not accept the required the output from classes like CountVectorizer or TfidfVectorizer.
This error is also happening when executing PCA on the results of CountVectorizer or TfidfVectorizer.
Steps/Code to reproduce bug
The code above works fine but, when the clustering algorithm is called it returns the following error
In the case of PCA, the code added before the clustering task is the following
and the error generated is
Expected behavior
The clustering and the PCA algorithms should return the clusters in a list, and another tabular structure data set for post processing.
Environment details (please complete the following information):
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster
Additional context
This error is related to this first mention #5805
The text was updated successfully, but these errors were encountered: