Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mlflow gc do not remove deleted experiments when runs were tracked with datasets #10699

Closed
6 of 23 tasks
Mlokos opened this issue Dec 14, 2023 · 3 comments · Fixed by #11695 · May be fixed by #11966
Closed
6 of 23 tasks

[BUG] mlflow gc do not remove deleted experiments when runs were tracked with datasets #10699

Mlokos opened this issue Dec 14, 2023 · 3 comments · Fixed by #11695 · May be fixed by #11966
Labels
area/deployments MLflow Deployments client APIs, server, and third-party Deployments integrations area/examples Example code area/server-infra MLflow Tracking server backend area/sqlalchemy Use of SQL alchemy in tracking service or model registry area/tracking Tracking service, tracking client APIs, autologging bug Something isn't working has-closing-pr This issue has a closing PR

Comments

@Mlokos
Copy link

Mlokos commented Dec 14, 2023

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

  • Client: 2.9.2
  • Tracking server: 2.9.2

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.3 LTS
  • Python version: Python 3.10.12
  • yarn version, if running the dev UI: -

Describe the problem

"mlflow gc" command do not remove deleted experiments when runs were tracked with datasets

Tracking information

It was a default setup from "Remote Experiment Tracking with MLflow Tracking Server" scenario:
https://mlflow.org/docs/latest/tracking/tutorials/remote-server.html

  1. When created, I have run a sample code (attached in below section - "Code to reproduce issue") to generate a "test" experiment.
  2. Then, I have deleted it with the web UI.
  3. After that I have run below commands:
export MLFLOW_TRACKING_URI="http://localhost:5000"
mlflow gc --backend-store-uri="postgresql://user:password@localhost:5432/mlflowdb"

Code to reproduce issue

import mlflow

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("test")
mlflow.autolog()

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

# Create and train models.
rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)

Stack trace

mlflow gc --backend-store-uri="postgresql://user:password@localhost:5432/mlflowdb"
Run with ID 6fb6bb680a2249388ec8346febc6540c has been permanently deleted.
Traceback (most recent call last):
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.ForeignKeyViolation: update or delete on table "experiments" violates foreign key constraint "datasets_experiment_id_fkey" on table "datasets"
DETAIL:  Key (experiment_id)=(1) is still referenced from table "datasets".


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/mlflow/store/db/utils.py", line 142, in make_managed_session
    session.commit()
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1969, in commit
    trans.commit(_to_root=True)
  File "<string>", line 2, in commit
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
    ret_value = fn(self, *arg, **kw)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1256, in commit
    self._prepare_impl()
  File "<string>", line 2, in _prepare_impl
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
    ret_value = fn(self, *arg, **kw)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1231, in _prepare_impl
    self.session.flush()
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 4312, in flush
    self._flush(objects)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 4447, in _flush
    with util.safe_reraise():
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 4408, in _flush
    flush_context.execute()
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 466, in execute
    rec.execute(self)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 679, in execute
    util.preloaded.orm_persistence.delete_obj(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 191, in delete_obj
    _emit_delete_statements(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 1456, in _emit_delete_statements
    c = connection.execute(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1416, in execute
    return meth(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 516, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1848, in _execute_context
    return self._exec_single_context(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1988, in _exec_single_context
    self._handle_dbapi_exception(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2343, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.ForeignKeyViolation) update or delete on table "experiments" violates foreign key constraint "datasets_experiment_id_fkey" on table "datasets"
DETAIL:  Key (experiment_id)=(1) is still referenced from table "datasets".

[SQL: DELETE FROM experiments WHERE experiments.experiment_id = %(experiment_id)s]
[parameters: {'experiment_id': 1}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mlokos/dev/mlflow/venv/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/mlflow/cli.py", line 630, in gc
    backend_store._hard_delete_experiment(experiment_id)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/mlflow/store/tracking/sqlalchemy_store.py", line 421, in _hard_delete_experiment
    with self.ManagedSessionMaker() as session:
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/mlokos/dev/mlflow/venv/lib/python3.10/site-packages/mlflow/store/db/utils.py", line 155, in make_managed_session
    raise MlflowException(message=e, error_code=BAD_REQUEST)
mlflow.exceptions.MlflowException: (psycopg2.errors.ForeignKeyViolation) update or delete on table "experiments" violates foreign key constraint "datasets_experiment_id_fkey" on table "datasets"
DETAIL:  Key (experiment_id)=(1) is still referenced from table "datasets".

[SQL: DELETE FROM experiments WHERE experiments.experiment_id = %(experiment_id)s]
[parameters: {'experiment_id': 1}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

Other info / logs

The issue is solely connected with datasets-table-handling.
I was able to make "mlflow gc" command work, but had to manually remove rows form that table.

psql -d mlflowdb -U user --password -h localhost -p 5432
truncate TABLE datasets; # inside psql cli

After that the "mlflow gc" command was able to remove data from PostgreSQL and artifacts from minIO.

export MLFLOW_TRACKING_URI="http://localhost:5000"
mlflow gc --backend-store-uri="postgresql://user:password@localhost:5432/mlflowdb"
Experiment with ID 1 has been permanently deleted.

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations
@Mlokos Mlokos added the bug Something isn't working label Dec 14, 2023
@github-actions github-actions bot added area/deployments MLflow Deployments client APIs, server, and third-party Deployments integrations area/examples Example code area/server-infra MLflow Tracking server backend area/sqlalchemy Use of SQL alchemy in tracking service or model registry area/tracking Tracking service, tracking client APIs, autologging labels Dec 14, 2023
@daniellok-db
Copy link
Collaborator

Thanks for the report! I can reproduce this error, will investigate and get back to you!

@vbousson
Copy link

Hello folks. Same bug was reproducted today, on the same remote tracking server setup, but with a MySQL database and GCS bucket storage.

A possible workaround I implemented on my system this morning is to flush partially the "datasets" table, before calling mlflow gc, with the following SQL command:
DELETE datasets FROM datasets JOIN experiments ON datasets.experiment_id = experiments.experiment_id WHERE experiments.lifecycle_stage = "deleted";

Copy link

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deployments MLflow Deployments client APIs, server, and third-party Deployments integrations area/examples Example code area/server-infra MLflow Tracking server backend area/sqlalchemy Use of SQL alchemy in tracking service or model registry area/tracking Tracking service, tracking client APIs, autologging bug Something isn't working has-closing-pr This issue has a closing PR
Projects
None yet
3 participants