Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with datasets with categorical attributes #1228

Open
janvanrijn opened this issue Mar 16, 2023 · 6 comments
Open

Problem with datasets with categorical attributes #1228

janvanrijn opened this issue Mar 16, 2023 · 6 comments

Comments

@janvanrijn
Copy link
Member

the following code crashes when applying on datasets with categorical attributes (comes from the examples)

@mfeurer @prabhant @PGijsbers

import openml
from sklearn import impute, tree, pipeline

# Define a scikit-learn classifier or pipeline
clf = pipeline.Pipeline(
    steps=[
        ('imputer', impute.SimpleImputer(strategy='constant', fill_value=-1)),
        ('estimator', tree.DecisionTreeClassifier())
    ]
)
openml.config.server = 'https://test.openml.org/api/v1/'
openml.config.apikey = 'removed'

# Download the OpenML task for the german credit card dataset with 10-fold
# cross-validation.
task = openml.tasks.get_task(1) # anneal dataset has categorical atts
# Run the scikit-learn model on the task.
run = openml.runs.run_model_on_task(clf, task)
@mfeurer
Copy link
Collaborator

mfeurer commented Mar 17, 2023

Could you please provide us with the output of

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import openml; print("OpenML", openml.__version__)

so we know the versions of scikit-learn and OpenML-Python?

@PGijsbers
Copy link
Collaborator

PGijsbers commented Mar 17, 2023

+1 for info, cannot reproduce this locally on a fresh install. Wait. Are you talking about the scikit-learn error from the line run = openml.runs.run_model_on_task(clf, task)? i.e. ValueError: could not convert string to float: 'ZS'? That is because you changed the dataset from the example. The provided scikit-learn pipeline can not handle string data, it would need an encoder for that.

@PGijsbers
Copy link
Collaborator

Sidenote: I noticed that task 32 is not actually credit-g (opened as separate issue #1229).

@janvanrijn
Copy link
Member Author

Hereby the version info:

Linux-5.19.0-35-generic-x86_64-with-glibc2.35
Python 3.10.9 (main, Mar  8 2023, 10:47:38) [GCC 11.2.0]
NumPy 1.23.5
SciPy 1.10.0
Scikit-Learn 1.2.2
OpenML 0.13.0

This is indeed the error ValueError: could not convert string to float: 'ZS'. Note that this is not a string value, but a categorical value. This AFAIK this is not dataset specific. I had similar issues on the live server in the OpenML-CC18. When I use task 7 on the test server (kr-vs-kp) I have similar issues: ValueError: could not convert string to float: 'f'.

I know that it is preferred to do OneHotEncoding, but in the past it worked also without (or, for example, when using first imputation and then hotencoding, this error occurs).

@PGijsbers
Copy link
Collaborator

There are also examples which work with categorical data, e.g., this pipeline from the docs, is it possible you mixed them up? As far as I am aware, openml-python never did any imputation or encoding itself, so then the only explanation would be that scikit-learn changed (though I'm not aware of any changes in scikit-learn that would explain the change).

Example for running a pipeline on kr-vs-kp:

import openml
from sklearn import pipeline, compose, preprocessing, impute, ensemble, tree

# OpenML helper functions for sklearn can be plugged in directly for complicated pipelines
from openml.extensions.sklearn import cat, cont


openml.config.start_using_configuration_for_example()

task = openml.tasks.get_task(7)

pipe = pipeline.Pipeline(
    steps=[
        (
            "Preprocessing",
            compose.ColumnTransformer(
                [
                    (
                        "categorical",
                        preprocessing.OneHotEncoder(sparse=False, handle_unknown="ignore"),
                        cat,  # returns the categorical feature indices
                    ),
                    (
                        "continuous",
                        impute.SimpleImputer(strategy="median"),
                        cont,
                    ),  # returns the numeric feature indices
                ]
            ),
        ),
        ("Classifier", tree.DecisionTreeClassifier()),
    ]
)

run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=False)

@LennartPurucker
Copy link
Contributor

I think @PGijsbers's statement and code are a potential solution to this issue.
Do you know if this resolved your problem, @janvanrijn? Or is this still a problem with openml-python that I could look into?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants