[BUG] cuml cannot split dataframe with string column. 5676 #5834

AndreasKarasenko · 2024-04-08T12:07:55Z

Describe the bug
When using cuml's train_test_split on a cudf dataframe with a string column it fails with "TypeError: String Arrays is not yet implemented in cudf". cudf refers back to cuml and there seems to be no news (see here).

Using sklearn or dask-ml for splitting works as expected, so I'm not sure if it even is a cudf issue.

Steps/Code to reproduce bug
This code uses a mix of the HPO example and the Naive Bayes example.

import cudf
import cuml
import cupy as cp
import numpy as np
import pandas as pd

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from sklearn.datasets import fetch_20newsgroups

# from dask_ml.model_selection import train_test_split
from cuml.model_selection import train_test_split
# from sklearn.model_selection import train_test_split

twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42)
if __name__ == "__main__":
    cluster = LocalCUDACluster()
    client = Client(cluster)

    # Load corpus
    twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42)
    df = pd.DataFrame(
        data={"text": twenty_train.data, "label": twenty_train.target},
        columns=["text", "label"],
    )
    df = cudf.DataFrame.from_pandas(df)

    X_train, X_test, y_train, y_test = train_test_split(df, "label", test_size=0.2) # does not work
    # works with sklearn or dask_ml
    # X_train, X_test, y_train, y_test = train_test_split(
    #     df.text, df.label, test_size=0.2, shuffle=False
    # )

Expected behavior
It should split the dataframe like sklearn or dask.

Environment details:

WSL2
Version: 24.02.00

dantegd · 2024-04-30T05:19:39Z

Currently we don't active test for string columns in the method if I'm not mistaken, since back when we originally added this to cuML we didn't think/expect string columns. That said, we will check if this is expected or a bug and update the issue. Thanks @AndreasKarasenko !

vyasr · 2024-05-17T18:18:24Z

I was doing some cudf issue triage and came across the linked issue rapidsai/cudf#12989. Please take a look at my last comment, it might provide some insight into what caused this incompatibility in cuml (it looks like this used to work in 23.02 so maybe cuml started converting to a cupy array internally during 23.04 development).

AndreasKarasenko added ? - Needs Triage Need team to review and classify bug Something isn't working labels Apr 8, 2024

vyasr mentioned this issue May 17, 2024

[BUG]TypeError: String Arrays is not yet implemented in cudf rapidsai/cudf#12989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cuml cannot split dataframe with string column. 5676 #5834

[BUG] cuml cannot split dataframe with string column. 5676 #5834

AndreasKarasenko commented Apr 8, 2024

dantegd commented Apr 30, 2024

vyasr commented May 17, 2024

[BUG] cuml cannot split dataframe with string column. 5676 #5834

[BUG] cuml cannot split dataframe with string column. 5676 #5834

Comments

AndreasKarasenko commented Apr 8, 2024

dantegd commented Apr 30, 2024

vyasr commented May 17, 2024