Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexplained running out of RAM memory with ensemble voting classifier model #295

Open
apavlo89 opened this issue Feb 13, 2024 · 0 comments

Comments

@apavlo89
Copy link

apavlo89 commented Feb 13, 2024

Hello,

I am experiencing an issue with high RAM usage when running ExplainerDashboard for a VotingClassifier ensemble model. The model is trained on a 2k dataset with 400 features in total however each classifier in voting classifier uses a subset of this feature set. My purpose is to understand how my algo algo makes predictions for a dataset in which i do not know the label outcome yet. Despite the dataset for explanation being relatively small (28 rows), the RAM usage spikes to over 51GB. Keep in mind I am running this in google collab. I suspect this might be related to how ExplainerDashboard handles ensemble models or the computation of SHAP values for such complex models or it might just be a bug. Below is a simplified version of my setup:

Model Setup

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Other imports...

# Define pipelines for individual models (example)
lr_pipeline = Pipeline([...])
xgb_pipeline = Pipeline([...])
# Other pipelines...

# VotingClassifier ensemble
eclf = VotingClassifier(
    estimators=[
        ('lr', lr_pipeline),
        ('xgb', xgb_pipeline),
        # Other models...
    ],
    voting='soft'
)

eclf.fit(X_train, y_train)


from explainerdashboard import ClassifierExplainer, ExplainerDashboard

# Initialize the Explainer without target labels
explainer = ClassifierExplainer(eclf, future_predict, shap='kernel', model_output='probability')

# Create and run the dashboard
dashboard = ExplainerDashboard(explainer)
dashboard.run(port=8050)
ngrok_tunnel = ngrok.connect(8050)
print('Public URL:', ngrok_tunnel.public_url)

This is the output:

WARNING: For shap='kernel', shap interaction values can unfortunately not be calculated!
Note: shap values for shap='kernel' normally get calculated against X_background, but paramater X_background=None, so setting X_background=shap.sample(X, 50)...
Generating self.shap_explainer = shap.KernelExplainer(model, X, link='identity')
Building ExplainerDashboard..
Detected google colab environment, setting mode='external'
No y labels were passed to the Explainer, so setting model_summary=False...
For this type of model and model_output interactions don't work, so setting shap_interaction=False...
The explainer object has no decision_trees property. so setting decision_trees=False...
Generating layout...
Calculating shap values...
/usr/local/lib/python3.10/dist-packages/dash/dash.py:538: UserWarning:

JupyterDash is deprecated, use Dash instead.
See https://dash.plotly.com/dash-in-jupyter for more details.

In just a few seconds of running the code use exceeds RAM availability and crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant