Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuml cannot split dataframe with string column. 5676 #5834

Open
AndreasKarasenko opened this issue Apr 8, 2024 · 2 comments
Open

[BUG] cuml cannot split dataframe with string column. 5676 #5834

AndreasKarasenko opened this issue Apr 8, 2024 · 2 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@AndreasKarasenko
Copy link

Describe the bug
When using cuml's train_test_split on a cudf dataframe with a string column it fails with "TypeError: String Arrays is not yet implemented in cudf". cudf refers back to cuml and there seems to be no news (see here).

Using sklearn or dask-ml for splitting works as expected, so I'm not sure if it even is a cudf issue.

Steps/Code to reproduce bug
This code uses a mix of the HPO example and the Naive Bayes example.

import cudf
import cuml
import cupy as cp
import numpy as np
import pandas as pd

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from sklearn.datasets import fetch_20newsgroups

# from dask_ml.model_selection import train_test_split
from cuml.model_selection import train_test_split
# from sklearn.model_selection import train_test_split

twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42)
if __name__ == "__main__":
    cluster = LocalCUDACluster()
    client = Client(cluster)

    # Load corpus
    twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42)
    df = pd.DataFrame(
        data={"text": twenty_train.data, "label": twenty_train.target},
        columns=["text", "label"],
    )
    df = cudf.DataFrame.from_pandas(df)

    X_train, X_test, y_train, y_test = train_test_split(df, "label", test_size=0.2) # does not work
    # works with sklearn or dask_ml
    # X_train, X_test, y_train, y_test = train_test_split(
    #     df.text, df.label, test_size=0.2, shuffle=False
    # )

Expected behavior
It should split the dataframe like sklearn or dask.

Environment details:

  • WSL2
  • Version: 24.02.00
@AndreasKarasenko AndreasKarasenko added ? - Needs Triage Need team to review and classify bug Something isn't working labels Apr 8, 2024
@dantegd
Copy link
Member

dantegd commented Apr 30, 2024

Currently we don't active test for string columns in the method if I'm not mistaken, since back when we originally added this to cuML we didn't think/expect string columns. That said, we will check if this is expected or a bug and update the issue. Thanks @AndreasKarasenko !

@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

I was doing some cudf issue triage and came across the linked issue rapidsai/cudf#12989. Please take a look at my last comment, it might provide some insight into what caused this incompatibility in cuml (it looks like this used to work in 23.02 so maybe cuml started converting to a cupy array internally during 23.04 development).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants