Problem with datasets with categorical attributes #1228

janvanrijn · 2023-03-16T22:31:22Z

the following code crashes when applying on datasets with categorical attributes (comes from the examples)

import openml
from sklearn import impute, tree, pipeline

# Define a scikit-learn classifier or pipeline
clf = pipeline.Pipeline(
    steps=[
        ('imputer', impute.SimpleImputer(strategy='constant', fill_value=-1)),
        ('estimator', tree.DecisionTreeClassifier())
    ]
)
openml.config.server = 'https://test.openml.org/api/v1/'
openml.config.apikey = 'removed'

# Download the OpenML task for the german credit card dataset with 10-fold
# cross-validation.
task = openml.tasks.get_task(1) # anneal dataset has categorical atts
# Run the scikit-learn model on the task.
run = openml.runs.run_model_on_task(clf, task)

The text was updated successfully, but these errors were encountered:

mfeurer · 2023-03-17T07:45:19Z

Could you please provide us with the output of

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import openml; print("OpenML", openml.__version__)

so we know the versions of scikit-learn and OpenML-Python?

PGijsbers · 2023-03-17T10:19:34Z

~~+1 for info, cannot reproduce this locally on a fresh install.~~ Wait. Are you talking about the scikit-learn error from the line run = openml.runs.run_model_on_task(clf, task)? i.e. ValueError: could not convert string to float: 'ZS'? That is because you changed the dataset from the example. The provided scikit-learn pipeline can not handle string data, it would need an encoder for that.

PGijsbers · 2023-03-17T10:28:49Z

Sidenote: I noticed that task 32 is not actually credit-g (opened as separate issue #1229).

janvanrijn · 2023-03-17T14:59:22Z

Hereby the version info:

Linux-5.19.0-35-generic-x86_64-with-glibc2.35
Python 3.10.9 (main, Mar  8 2023, 10:47:38) [GCC 11.2.0]
NumPy 1.23.5
SciPy 1.10.0
Scikit-Learn 1.2.2
OpenML 0.13.0

This is indeed the error ValueError: could not convert string to float: 'ZS'. Note that this is not a string value, but a categorical value. This AFAIK this is not dataset specific. I had similar issues on the live server in the OpenML-CC18. When I use task 7 on the test server (kr-vs-kp) I have similar issues: ValueError: could not convert string to float: 'f'.

I know that it is preferred to do OneHotEncoding, but in the past it worked also without (or, for example, when using first imputation and then hotencoding, this error occurs).

PGijsbers · 2023-03-17T15:14:14Z

There are also examples which work with categorical data, e.g., this pipeline from the docs, is it possible you mixed them up? As far as I am aware, openml-python never did any imputation or encoding itself, so then the only explanation would be that scikit-learn changed (though I'm not aware of any changes in scikit-learn that would explain the change).

Example for running a pipeline on kr-vs-kp:

import openml
from sklearn import pipeline, compose, preprocessing, impute, ensemble, tree

# OpenML helper functions for sklearn can be plugged in directly for complicated pipelines
from openml.extensions.sklearn import cat, cont


openml.config.start_using_configuration_for_example()

task = openml.tasks.get_task(7)

pipe = pipeline.Pipeline(
    steps=[
        (
            "Preprocessing",
            compose.ColumnTransformer(
                [
                    (
                        "categorical",
                        preprocessing.OneHotEncoder(sparse=False, handle_unknown="ignore"),
                        cat,  # returns the categorical feature indices
                    ),
                    (
                        "continuous",
                        impute.SimpleImputer(strategy="median"),
                        cont,
                    ),  # returns the numeric feature indices
                ]
            ),
        ),
        ("Classifier", tree.DecisionTreeClassifier()),
    ]
)

run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=False)

LennartPurucker · 2023-04-16T12:28:41Z

I think @PGijsbers's statement and code are a potential solution to this issue.
Do you know if this resolved your problem, @janvanrijn? Or is this still a problem with openml-python that I could look into?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with datasets with categorical attributes #1228

Problem with datasets with categorical attributes #1228

janvanrijn commented Mar 16, 2023

mfeurer commented Mar 17, 2023

PGijsbers commented Mar 17, 2023 •

edited

PGijsbers commented Mar 17, 2023

janvanrijn commented Mar 17, 2023

PGijsbers commented Mar 17, 2023

LennartPurucker commented Apr 16, 2023

Problem with datasets with categorical attributes #1228

Problem with datasets with categorical attributes #1228

Comments

janvanrijn commented Mar 16, 2023

mfeurer commented Mar 17, 2023

PGijsbers commented Mar 17, 2023 • edited

PGijsbers commented Mar 17, 2023

janvanrijn commented Mar 17, 2023

PGijsbers commented Mar 17, 2023

LennartPurucker commented Apr 16, 2023

PGijsbers commented Mar 17, 2023 •

edited