Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support files for pyfunc models and model_config #11951

Merged
merged 37 commits into from May 17, 2024

Conversation

annzhang-db
Copy link
Collaborator

@annzhang-db annzhang-db commented May 9, 2024

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

pip install git+https://github.com/mlflow/mlflow.git@refs/pull/11951/merge

Checkout with GitHub CLI

gh pr checkout 11951

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

  • Move shared utils for langchain and pyfunc to models/utils
  • Support logging and loading code from files for pyfunc models
    • Introduce code model loader module
  • Support yaml files for model_config in pyfunc models

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Copy link

github-actions bot commented May 9, 2024

Documentation preview for afb8b84 will be available when this CircleCI job
completes successfully.

More info

@github-actions github-actions bot added the rn/none List under Small Changes in Changelogs. label May 9, 2024
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
@bbqiu bbqiu mentioned this pull request May 9, 2024
39 tasks
Copy link
Collaborator

@sunishsheth2009 sunishsheth2009 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harupy can you help review the code as well? Specially pyfunc.load_model.

Other than that, it looks good to me.

mlflow/pyfunc/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@bbqiu bbqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great - would it be possible for you to adjust the error message here as well?

"When `python_model` is a callable object, it must accept exactly one argument. "
f"Found {num_args} arguments.",

mlflow/langchain/__init__.py Outdated Show resolved Hide resolved
mlflow/pyfunc/__init__.py Outdated Show resolved Hide resolved
mlflow/pyfunc/__init__.py Outdated Show resolved Hide resolved
@bbqiu
Copy link
Collaborator

bbqiu commented May 9, 2024

could you also fill in the testing section w/ any DB notebooks you tested the loading behavior of paths that lead to db notebooks in?

@annzhang-db annzhang-db requested a review from harupy May 10, 2024 08:38
mlflow/pyfunc/__init__.py Outdated Show resolved Hide resolved
mlflow/pyfunc/__init__.py Outdated Show resolved Hide resolved
if python_model:
model_code_path = None
if isinstance(python_model, str):
model_code_path = _validate_and_get_model_code_path(python_model)
Copy link
Member

@harupy harupy May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the python_model file depends on other custom modules or external resources? Do they need to be logged as artifacts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure how that would work - can we address this in a follow-up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the python_model file depends on other custom modules or external resources? Do they need to be logged as artifacts?

@harupy I think those custom modules and resources can be passed in as pip_requirements or code_paths and those should still work here right? Or am I missing something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annzhang-db @sunishsheth2009 can we write a test for that case, if we assume it will work? If it doesn't work, we need to either fix it or update documentation to indicate that it isn't supported

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote a test for this - code_paths works with the functionality 😄

mlflow/utils/model_utils.py Outdated Show resolved Hide resolved
mlflow/utils/model_utils.py Outdated Show resolved Hide resolved
mlflow/pyfunc/__init__.py Outdated Show resolved Hide resolved
mlflow/pyfunc/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments that should simplify and derisk these changes by removing lots of if / else logic.

Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
mlflow/pyfunc/model.py Outdated Show resolved Hide resolved
mlflow/pyfunc/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @annzhang-db ! Left some comments - let me know if you have questions!

Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to this file are all for clarity, no functionality changes.

lc_model = _load_model_code_path(model_code_path, model_config)
_validate_and_copy_file_to_directory(model_code_path, path, "code")
else:
lc_model = lc_model_or_path
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this point, lc_model is always a langchain model instance - it cannot be a path.

@@ -252,49 +252,32 @@ def load_retriever(persist_directory):
import langchain
from langchain.schema import BaseRetriever

lc_model = _validate_and_wrap_lc_model(lc_model, loader_fn)
lc_model_or_path = _validate_and_prepare_lc_model_or_path(lc_model, loader_fn)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use variable name lc_model_or_path while both a langchain model or file path are possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor - move shared functions from langchain/utils.py to models/utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor - moved functions verbatim from langchain/utils.py here to be used for pyfunc as well. No code changes.

Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
@@ -900,49 +882,6 @@ def load_model(model_uri, dst_path=None):
return _load_model_from_local_fs(local_model_path)


@contextmanager
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved code to models/utils.py

Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved functions to models/utils.py.

@@ -2328,8 +2363,6 @@ def predict(model_input: List[str]) -> List[str]:
)
raise MlflowException(message=msg, error_code=INVALID_PARAMETER_VALUE)

_validate_and_prepare_target_save_path(path)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved up so we can copy the model code file to the save path.

Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
@dbczumar dbczumar requested a review from ian-ack-db May 17, 2024 01:33
@annzhang-db annzhang-db requested a review from dbczumar May 17, 2024 04:40
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Signed-off-by: Ann Zhang <ann.zhang@databricks.com>
Comment on lines +925 to +933
from mlflow.langchain import _validate_and_prepare_lc_model_or_path

# If its not a PyFuncModel, then it should be a Langchain model
_validate_and_wrap_lc_model(model, None)
# If its not a PyFuncModel, then it should be a Langchain model (not a path)
# Check this since the validation function does not
if isinstance(model, str):
raise mlflow.MlflowException(
"Model should either be an instance of PyFuncModel or Langchain type."
)
model = _validate_and_prepare_lc_model_or_path(model, None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annzhang-db @sunishsheth2009 I recall that we talked about making set_model support a Python function too. This is also supported by mlflow.pyfunc.log_model(). Can we support this if it's easy or file a follow-up internal ticket?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we have a ticket for that option as well. We will have to do it as a follow up. :)

@@ -839,7 +820,8 @@ def predict(
return result


def _load_pyfunc(path):
# TODO: Support loading langchain with model_config. For now, this is a no-op.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annzhang-db Do we have an internal ticket to track this? It would be very useful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a ticket!

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annzhang-db LGTM once remaining comments are addressed. Tried this out manually with a few variations on the following scripts; it works quite well:

  • script.py
import mlflow
from mlflow.models import set_model
from mlflow.models import ModelConfig

config = ModelConfig()

class Model(mlflow.pyfunc.PythonModel):
    def predict(self, context, model_input, params):
        print("CONFIG", config.config)
        print("CONFIG", config.get("A"))
        print("PARAMS", params)
        return model_input.apply(lambda column: column + 1)

set_model(Model())
  • logger.py
import mlflow
from mlflow.models import infer_signature



import pandas

df = pandas.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "c"])

sig = infer_signature(df, params={"BAZ": "BAR"})

model_info = mlflow.pyfunc.log_model(
    python_model="script.py",
    artifact_path="baz",
    signature=sig,
    model_config={"A": "B"},
)


print(model_info.model_uri)

mod = mlflow.pyfunc.load_model(model_info.model_uri, model_config={"A": "B"})
print(mod)
print(mod.predict(df, params={"BAZ": "BAR"}))

@annzhang-db annzhang-db merged commit 9eeaef9 into mlflow:master May 17, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rn/none List under Small Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants