Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Expecting data to be a DMatrix object, got: ', <class 'pandas.core.frame.DataFrame'> #498

Open
yzheng27 opened this issue Feb 1, 2022 · 7 comments

Comments

@yzheng27
Copy link

yzheng27 commented Feb 1, 2022

Was following example https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-regression-local.ipynb on my own data and xgboost object, but get error ('Expecting data to be a DMatrix object, got: ', <class 'pandas.core.frame.DataFrame'>) at explainer.explain_global(x_test). Changed x_test to DMatrix generates error 'DMatrix' object has no attribute 'shape'. Please advise. Thank you.

x_train, x_test, y_train, y_test = train_test_split(df[features], df[LABEL], test_size=0.2, random_state=0)

from interpret.ext.blackbox import TabularExplainer
explainer = TabularExplainer(model, 
                             x_train, 
                             model_task = 'regression',
                             features=features)
global_explanation = explainer.explain_global(x_test)
# xgtest = xgb.DMatrix(x_test.values)
# global_explanation = explainer.explain_global(xgtest)

Version:
interpret-community==0.23.0
interpret-core==0.2.7
xgboost==1.4.1

@gaugup
Copy link
Collaborator

gaugup commented Feb 2, 2022

@yzheng27 Thanks for reporting the issue. Could you try with the latest interpret-community release 0.24.2 and see if you continue to see this issue? In case you still see the issue, could you provide a sample notebook so that we can reproduce this issue locally. A stack trace of the error will also help us greatly in triaging this issue.

Regards,

@imatiach-msft
Copy link
Collaborator

imatiach-msft commented Feb 2, 2022

@gaugup I think the issue is happening because they are using the XGBoost API that uses DMatrix, instead of the scikit-learn XGBoost API that is pandas compatible, so I'm guessing that upgrading to latest version won't fix it. @yzheng27 I will take a look to see if we can support DMatrix from XGBoost somehow, but an easy quick fix would be to use the scikit-learn API for XGBoost,

@yzheng27
Copy link
Author

yzheng27 commented Feb 2, 2022

thank you. i was able to generate the global_explanation by loading the model with scikit-learn interface. But now my notebook is running code below for several hours. is it expected? the shape of x_test is around 24000*325.

ExplanationDashboard(global_explanation, model, dataset=x_test, true_y=y_test, public_ip = host, port = 7780)

@imatiach-msft
Copy link
Collaborator

"the shape of x_test is around 24000*325"
@yzheng27 yes that may be too large for the UI to handle, please limit it by downsampling to ~5k rows instead of 24K. If you are still seeing issues with downsampled data, then there might be something about the host/port configuration. However even then you should still see the dashboard, just what-if analysis and ICE plots won't work in the ExplanationDashboard.

@imatiach-msft
Copy link
Collaborator

imatiach-msft commented Feb 3, 2022

@yzheng27 one other thing, are you importing the dashboard from raiwidgets package, on this repository:

from raiwidgets import ExplanationDashboard

https://github.com/microsoft/responsible-ai-toolbox

Make sure you don't import it from interpret-community package, as it has been moved to the other repository.

Also, can you run:

pip show raiwidgets

to check that you have the latest version of raiwidgets package with ExplanationDashboard?

@yzheng27
Copy link
Author

yzheng27 commented Feb 4, 2022

@imatiach-msft i'm using the library from raiwidgets and the version is 0.15.1.

I was able to get the dashboard with the data dimensions I mentioned, though it took several hours. Will try with the smaller data.

@imatiach-msft
Copy link
Collaborator

imatiach-msft commented Feb 4, 2022

@yzheng27 if it took several hours but eventually worked then it must be that the UI just loaded too much data, and downsampling should speed it up significantly. All of the datapoints are loaded into the UI and I've noticed that usually after >5k datapoints the UI becomes very slow. Perhaps there is some way to change the UI to stream select data from python backend or to aggregate statistics across multiple points in the future for users who want to run it on a lot of data, I'm not sure. The ErrorAnalysisDashboard is actually able to work on millions of points if you pass in a sample_dataset for the Dataset Explorer, so perhaps something like that could be done for the ExplanationDashboard as well:

https://github.com/microsoft/responsible-ai-toolbox/blob/main/raiwidgets/tests/test_error_analysis_dashboard.py#L83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants