Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering problems with NLP and CUML #5805

Open
erico-imgproj opened this issue Mar 15, 2024 · 2 comments
Open

Clustering problems with NLP and CUML #5805

erico-imgproj opened this issue Mar 15, 2024 · 2 comments
Labels
? - Needs Triage Need team to review and classify question Further information is requested

Comments

@erico-imgproj
Copy link

What is your question?
During processing of a large NLP dataset I found an very good example on cuml documentation site example. Following its instructions I wrote my own version for my dataset. My dataset contains 6 million phrases, and I wish to run a clustering algorithm to begin testing.

import dask
dask.config.set(**{'array.slicing.split_large_chunks': True})
import cupy as cp
import cudf
from cuml.dask.common import to_sparse_dask_array
from cuml.feature_extraction.text import CountVectorizer
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)
n_workers = 4
cluster.scale(n_workers)

workers = client.has_what().keys()

fulldf=cudf.read_parquet('phrases.parquet')
fulldf = fulldf[~fulldf.origphrase.isna()]
fulldf.origphrase = fulldf.origphrase.astype(str)

cv = CountVectorizer()
X_tfidf = cv.fit_transform(fulldf['origphrase']).astype(cp.float32)
X = to_sparse_dask_array(X_tfidf, client)


from cuml.dask.cluster import KMeans

kmeans_float = KMeans(n_clusters=51)
yhat = kmeans_float.fit_predict(X)#Line where error happens

After preprocessing the data, the X variable is of type

dask.array<from-value, shape=(6261516, 232309), dtype=float64, chunksize=(6261516, 232309), chunktype=cupyx.csr_matrix>

which is the same type that the rapids example presents. Unfortunately, I run into a problem when I load it into the KMeans model.

2024-03-15 12:49:50,008 - distributed.worker - WARNING - Compute Failed
Key:       _func_fit-3f8c5c07-bc70-4821-9613-bc8545faf086
Function:  _func_fit
args:      (b'\x99\xab\xfbBR\xf5M\xd1\x91v\x1f9\x0et\xaa\xd3', [<cupyx.scipy.sparse._csr.csr_matrix object at 0x7f134f8825f0>], 'cupy', False)
kwargs:    {'n_clusters': 51, 'verbose': False}
Exception: 'AttributeError("\'NoneType\' object has no attribute \'shape\'")'

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/kmeans.py", line 198, in fit_predict
    return self.fit(X, sample_weight=sample_weight).predict(
  File "/home/erico/lab/packages_dask/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/kmeans.py", line 175, in fit
    wait_and_raise_from_futures(kmeans_fit)
  File "/home/erico/lab/packages_dask/cuml/dask/common/utils.py", line 164, in wait_and_raise_from_futures
    raise_exception_from_futures(futures)
  File "/home/erico/lab/packages_dask/cuml/dask/common/utils.py", line 152, in raise_exception_from_futures
    raise RuntimeError(
RuntimeError: 1 of 1 worker jobs failed: 'NoneType' object has no attribute 'shape'

If I try to run yhat = kmeans_float.fit_predict(X.compute()) the error changes to

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/kmeans.py", line 198, in fit_predict
    return self.fit(X, sample_weight=sample_weight).predict(
  File "/home/erico/lab/packages_dask/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/kmeans.py", line 154, in fit
    data = DistributedDataHandler.create(inputs, client=self.client)
  File "/home/erico/lab/packages_dask/cuml/dask/common/input_utils.py", line 108, in create
    datatype, multiple = _get_datatype_from_inputs(data)
  File "/home/erico/lab/packages_dask/cuml/dask/common/input_utils.py", line 193, in _get_datatype_from_inputs
    validate_dask_array(data)
  File "/home/erico/lab/packages_dask/cuml/dask/common/dask_arr_utils.py", line 34, in validate_dask_array
    if len(darray.chunks) > 2:
AttributeError: 'csr_matrix' object has no attribute 'chunks'

Changing the clustering algorithm also does not help. For instance, I tried the following code:

from cuml.dask.cluster import DBSCAN
model = DBSCAN(min_samples=5, gen_min_span_tree=True)
yhat = model.fit_predict(X)

And I get this error

Key:       _func-1d45cd66-2d8a-4525-9c2e-f576d861a7c5
Function:  _func
args:      (b'\\\x97\xd5\x9fJ\xd3@\xe3\x92\xc7Y\x00\xad\x8a\xb7\x0b', dask.array<from-value, shape=(6261516, 232309), dtype=float64, chunksize=(6261516, 232309), chunktype=cupyx.csr_matrix>)
kwargs:    {'min_samples': 5, 'gen_min_span_tree': True, 'verbose': False}
Exception: "ValueError('setting an array element with a sequence.')"

2024-03-15 12:54:36,754 - distributed.worker - WARNING - Compute Failed
Key:       _func-092bbae9-6a71-4b2e-b670-445edd0005e9
Function:  _func
args:      (b'\\\x97\xd5\x9fJ\xd3@\xe3\x92\xc7Y\x00\xad\x8a\xb7\x0b', dask.array<from-value, shape=(6261516, 232309), dtype=float64, chunksize=(6261516, 232309), chunktype=cupyx.csr_matrix>)
kwargs:    {'min_samples': 5, 'gen_min_span_tree': True, 'verbose': False}
Exception: "ValueError('setting an array element with a sequence.')"

2024-03-15 12:54:36,757 - distributed.worker - WARNING - Compute Failed
Key:       _func-9cc5519b-dda6-4208-8e76-a1a6f345c4ad
Function:  _func
args:      (b'\\\x97\xd5\x9fJ\xd3@\xe3\x92\xc7Y\x00\xad\x8a\xb7\x0b', dask.array<from-value, shape=(6261516, 232309), dtype=float64, chunksize=(6261516, 232309), chunktype=cupyx.csr_matrix>)
kwargs:    {'min_samples': 5, 'gen_min_span_tree': True, 'verbose': False}
Exception: "ValueError('setting an array element with a sequence.')"

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/dbscan.py", line 160, in fit_predict
    self.fit(X, out_dtype)
  File "/home/erico/lab/packages_dask/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/dbscan.py", line 133, in fit
    wait_and_raise_from_futures(dbscan_fit)
  File "/home/erico/lab/packages_dask/cuml/dask/common/utils.py", line 164, in wait_and_raise_from_futures
    raise_exception_from_futures(futures)
  File "/home/erico/lab/packages_dask/cuml/dask/common/utils.py", line 152, in raise_exception_from_futures
    raise RuntimeError(
RuntimeError: 4 of 4 worker jobs failed: setting an array element with a sequence., setting an array element with a sequence., setting an array element with a sequence., setting an array element with a sequence.

Any help is appreciated

@dantegd
Copy link
Member

dantegd commented Mar 20, 2024

Thanks for the issue! I'm not entirely sure what's happening, any chance you could run the script https://github.com/rapidsai/cuml/blob/branch-24.04/print_env.sh and post the output to see what versions of cuml/dask/etc you have, which will be super useful to reproduce.

@erico-imgproj
Copy link
Author

Hello

Here is my configuration

cubinlinker-cu11          0.3.0.post1
cucim-cu11                24.2.0
cuda-python               11.8.3
cudf-cu11                 24.2.2
cugraph-cu11              24.2.0
cuml-cu11                 24.2.0
cuproj-cu11               24.2.0
cupy-cuda11x              13.0.0
cuspatial-cu11            24.2.0
cuxfilter-cu11            24.2.0
dask                      2024.1.1
dask-cuda                 24.2.0
dask-cudf-cu11            24.2.2
dask-glm                  0.3.2
dask-ml                   2023.3.24
raft-dask-cu11            24.2.0
rapids-dask-dependency    24.2.0

I hope it helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants