Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHOAIENG-2244: Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile #254

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

LaVLaS
Copy link
Member

@LaVLaS LaVLaS commented Apr 23, 2024

Description

  • Updates the fetch-model-s3 task to provide basic support for fetching a model from a sub directory under the S3 bucket removing the hardcoded requirement that all model directories be stored at the root of the S3 bucket.
  • Adds a universal OpenVino containerfile that can build an inference container image for any supported OpenVino Model format

This provides the ability for users to utilize the OpenShift AI Fraud Detection tutorial in the workflow to train and deploy a model using our pipelines

Jira: RHOAIENG-2244

How Has This Been Tested?

  • Followed the RHOAI Fraud Detection example to train a model and used that with minio and our Pipeline
    OR
  • Fetch the fraud-detection model from s3://rhoai-edge/example-models/fraud-detection

The test-model-rest-svc must be updated to match output based on the data: not prediction:

The Continerfile.openvino to build the inferencing container must be used with this model

Merge criteria:

Existing pipeline workflows for bike-rental and tensorflow-housing are not affected

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link

openshift-ci bot commented Apr 23, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lavlas. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MarianMacik MarianMacik self-requested a review April 23, 2024 13:32
Copy link
Contributor

@adelton adelton Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? It does not seem to be used in any of the PipelineRuns and hence not tested by our CI. Does it replace pipelines/containerfiles/Containerfile.openvino.mlserver.mlflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not replace the Containerfile you referenced due to that file being hardcoded to the tensorflow-housing model we use as an example. I would need to refactor the tensorflow-housing model away from MLFlow to align it with the Containerfile.openvino.

I included Containerfile.openvino in the PR since this is what I needed to make this work with the Fraud Detection tutorial but it is generic for any OpenVino model that is being packaged into an inference container. A key feature that the new Containerfile supports is model versioning so that if the model contains multiple versions /model/<model name>/{1,2,3,...,999} OpenVino will automatically serve the highest version found for inferencing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why can't this be used for the tensorflow-housing instead of the original one, so that it gets exposed and tested, rather than just being a "random" file in the repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why can't this be used for the tensorflow-housing instead of the original one,....

I added the comment in the Containerfile.openvino.mlserver.mlflow file to address that issue. The tensorflow-housing models were developed for an MLFlow workflow so I would need to update the directory structure for tensorflow-housing in Git|S3 to include the versioning structure required by OpenVino. Since the original model training code was not written for that workflow, my strong preference is to update the training code to support that structure for reproducibility.

so that it gets exposed and tested, rather than just being a "random" file in the repo?

This would be a follow-up PR to bring the Fraud Detection model workflow into our supported workflow. I included it in this PR because it is required working building our OpenVino inference containers with any number of model versions AND it allows anyone reviewing & testing the PR to use this Containerfile to verify OpenVino functionality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since documentation to do that is not there, I doubt about the utility of that at this point.

@@ -29,6 +29,9 @@ This repo currently contains two examples:
| **Model (Pipelines)** | `tensorflow-housing` | `bike-rentals-auto-ml` |
| **App (ACM)** | `tensorflow-housing-app` | `bike-rental-app` |


*Additionally*, you can follow the [OpenShift AI tutorial - Fraud detection example](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.8/html/openshift_ai_tutorial_-_fraud_detection_example/index) to execute the end to end workflow of training a model from scratch & building the [inferencing container](pipelines/containerfiles/Containerfile.openvino) to deliver to Edge environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to promote the fraud detection tutorial (which I like), shouldn't we just expand all the relevant places and PipelineRuns to actually lead the reader through the steps, and also have things tested by the CI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but that is out of scope for this PR. A lot of the current pipeline workflows are specific to the initial demo use case for the bike-rental and tensorflow-housing models to show potential integration w/ AzureML and MLFlow. We need to refactor some of the tasks and pipelines to work with any model supported by OpenVino in ODH

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a maintenance liability.

Is it critical that this goes in in this PR? If not, I'd just leave this change out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it critical that this goes in in this PR? If not, I'd just leave this change out.

Which change, this text in the DEVELOPERS doc? I don't have any particular need to included this the document other than to point people to documented workflow that is supported and fits our E2E scenario training and delivering models to the edge.

This is a maintenance liability.

I am curious about this. Is it a maintenance liability because it is based on code outside of this repository?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that us writing here that a user can follow that tutorial and use our pipeline for build might mean that it stops working in the future when we don't test it permanently, however, we can at least say it, that it is not permanently tested by us and it is best effort from our side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all good points. I am going to separate all of the "Fraud Detection Model" specific changes, this DEVELOPERS.md and Containerfile, into a separate story/work since some of the longterm fixes should be aligned with a complete rewrite of the pipeline, tasks and manifests.

I get that us writing here that a user can follow that tutorial and use our pipeline for build might mean that it stops working in the future when we don't test it permanently,

IMO, this scenario does have a benefit of us dogfooding our own tutorials since this tutorial and official documentation are maintained by another group working on OpenShift AI. If we find a bug, it would be good to file it early in the upstream releases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pointing people somewhere with some promise of experiencing the end-to-end example across two repos, without including the relevant steps in our documentation, will lead to people (including myself) to get stuck with "OK, what next?" questions, either because the steps are not trivial or because the other repo has changed significantly so our assumptions in the Containerfile no longer work.

Folks will either abandon the attempt, or they will file issues here (better) or in the https://github.com/rh-aiservices-bu/fraud-detection (worse, maintainers of that repo don't know anything about this work). even issues filed against this repo will be hard to respond to in a timely and effective manner if we don't have the experience with that other repo among the team and in the CI.

@LaVLaS LaVLaS changed the title Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile RHOAIENG-2244: Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile Apr 24, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 24, 2024

@LaVLaS: This pull request references RHOAIENG-2244 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to this:

Description

  • Updates the fetch-model-s3 task to provide basic support for fetching a model from a sub directory under the S3 bucket removing the hardcoded requirement that all model directories be stored at the root of the S3 bucket.
  • Adds a universal OpenVino containerfile that can build an inference container image for any supported OpenVino Model format

This provides the ability for users to utilize the OpenShift AI Fraud Detection tutorial in the workflow to train and deploy a model using our pipelines

Jira: RHOAIENG-2244

How Has This Been Tested?

  • Followed the RHOAI Fraud Detection example to train a model and used that with minio and our Pipeline
    OR
  • Fetch the fraud-detection model from s3://rhoai-edge/example-models/fraud-detection

The test-model-rest-svc must be updated to match output based on the data: not prediction:

The Continerfile.openvino to build the inferencing container must be used with this model

Merge criteria:

Existing pipeline workflows for bike-rental and tensorflow-housing are not affected

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@@ -319,7 +322,7 @@ spec:
if [ "$(params.upon-end)" == "stop" ]; then
oc scale deployment.apps/$(params.model-name)-$(params.model-version) --replicas=0
elif [ "$(params.upon-end)" == "delete" ]; then
oc delete all --selector=app=$(params.model-name)-$(params.model-version)
oc delete deployment,service --selector=app=$(params.model-name)-$(params.model-version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change needed?

@@ -255,6 +257,7 @@ spec:
containers:
- image: $(params.candidate-image-tag-reference)
name: $(params.model-name)-$(params.model-version)
imagePullPolicy: Always
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this changed to work in case we push the same label twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I was stuck in a debuging loop testing updates to the Containerfile that were not propagating to the actual test-deployment pod due to a stale directory being present in the running pod although the build step had renamed the folder. Ultimately it appeared that the hash was a much older build so the quickest fix was to set it to Always.

Another option would be to use the digest hash from the previous build for these test pods to ensure we are testing the exact build

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the digest hash might be a bulletproof option, we can fix it in a separate PR maybe.

@@ -1,3 +1,7 @@
# This ContainerFile is used to manually convert the MLFlow tensorflow-housing example to a format that works with the OpenVino Model Server
# The tensorflow-housing example we provide in the ai-edge repo copies the tensorflow-housing/tf2model/{saved_model.pb,variables/} directory to a
# format that is compatible with OpenVino Model Server: /model/<MODEL VERSION>/<TENSORFLOW SAVED MODEL FORMAT>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# format that is compatible with OpenVino Model Server: /model/<MODEL VERSION>/<TENSORFLOW SAVED MODEL FORMAT>
# format that is compatible with OpenVino Model Server: /model/<MODEL VERSION>/<TENSORFLOW SAVED MODEL FORMAT>

@@ -29,6 +29,9 @@ This repo currently contains two examples:
| **Model (Pipelines)** | `tensorflow-housing` | `bike-rentals-auto-ml` |
| **App (ACM)** | `tensorflow-housing-app` | `bike-rental-app` |


*Additionally*, you can follow the [OpenShift AI tutorial - Fraud detection example](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.8/html/openshift_ai_tutorial_-_fraud_detection_example/index) to execute the end to end workflow of training a model from scratch & building the [inferencing container](pipelines/containerfiles/Containerfile.openvino) to deliver to Edge environments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that us writing here that a user can follow that tutorial and use our pipeline for build might mean that it stops working in the future when we don't test it permanently, however, we can at least say it, that it is not permanently tested by us and it is best effort from our side.

@LaVLaS LaVLaS marked this pull request as draft April 26, 2024 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants