RHOAIENG-2244: Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile #254

LaVLaS · 2024-04-23T03:36:58Z

Description

Updates the fetch-model-s3 task to provide basic support for fetching a model from a sub directory under the S3 bucket removing the hardcoded requirement that all model directories be stored at the root of the S3 bucket.
Adds a universal OpenVino containerfile that can build an inference container image for any supported OpenVino Model format

This provides the ability for users to utilize the OpenShift AI Fraud Detection tutorial in the workflow to train and deploy a model using our pipelines

Jira: RHOAIENG-2244

How Has This Been Tested?

Followed the RHOAI Fraud Detection example to train a model and used that with minio and our Pipeline
OR
Fetch the fraud-detection model from s3://rhoai-edge/example-models/fraud-detection

The test-model-rest-svc must be updated to match output based on the data: not prediction:

The Continerfile.openvino to build the inferencing container must be used with this model

Merge criteria:

Existing pipeline workflows for bike-rental and tensorflow-housing are not affected

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

openshift-ci · 2024-04-23T03:37:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lavlas. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adelton · 2024-04-23T13:37:47Z

pipelines/containerfiles/Containerfile.openvino

What is this for? It does not seem to be used in any of the PipelineRuns and hence not tested by our CI. Does it replace pipelines/containerfiles/Containerfile.openvino.mlserver.mlflow?

It does not replace the Containerfile you referenced due to that file being hardcoded to the tensorflow-housing model we use as an example. I would need to refactor the tensorflow-housing model away from MLFlow to align it with the Containerfile.openvino.

I included Containerfile.openvino in the PR since this is what I needed to make this work with the Fraud Detection tutorial but it is generic for any OpenVino model that is being packaged into an inference container. A key feature that the new Containerfile supports is model versioning so that if the model contains multiple versions /model/<model name>/{1,2,3,...,999} OpenVino will automatically serve the highest version found for inferencing

So why can't this be used for the tensorflow-housing instead of the original one, so that it gets exposed and tested, rather than just being a "random" file in the repo?

So why can't this be used for the tensorflow-housing instead of the original one,....

I added the comment in the Containerfile.openvino.mlserver.mlflow file to address that issue. The tensorflow-housing models were developed for an MLFlow workflow so I would need to update the directory structure for tensorflow-housing in Git|S3 to include the versioning structure required by OpenVino. Since the original model training code was not written for that workflow, my strong preference is to update the training code to support that structure for reproducibility.

so that it gets exposed and tested, rather than just being a "random" file in the repo?

This would be a follow-up PR to bring the Fraud Detection model workflow into our supported workflow. I included it in this PR because it is required working building our OpenVino inference containers with any number of model versions AND it allows anyone reviewing & testing the PR to use this Containerfile to verify OpenVino functionality

Since documentation to do that is not there, I doubt about the utility of that at this point.

adelton · 2024-04-23T13:40:18Z

DEVELOPMENT.md

@@ -29,6 +29,9 @@ This repo currently contains two examples:
 | **Model (Pipelines)** | `tensorflow-housing`     | `bike-rentals-auto-ml` |
 | **App (ACM)**         | `tensorflow-housing-app` | `bike-rental-app`      |

+
+*Additionally*, you can follow the [OpenShift AI tutorial - Fraud detection example](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.8/html/openshift_ai_tutorial_-_fraud_detection_example/index) to execute the end to end workflow of training a model from scratch & building the [inferencing container](pipelines/containerfiles/Containerfile.openvino) to deliver to Edge environments.


If we are going to promote the fraud detection tutorial (which I like), shouldn't we just expand all the relevant places and PipelineRuns to actually lead the reader through the steps, and also have things tested by the CI?

Yes but that is out of scope for this PR. A lot of the current pipeline workflows are specific to the initial demo use case for the bike-rental and tensorflow-housing models to show potential integration w/ AzureML and MLFlow. We need to refactor some of the tasks and pipelines to work with any model supported by OpenVino in ODH

This is a maintenance liability.

Is it critical that this goes in in this PR? If not, I'd just leave this change out.

Is it critical that this goes in in this PR? If not, I'd just leave this change out.

Which change, this text in the DEVELOPERS doc? I don't have any particular need to included this the document other than to point people to documented workflow that is supported and fits our E2E scenario training and delivering models to the edge.

This is a maintenance liability.

I am curious about this. Is it a maintenance liability because it is based on code outside of this repository?

I get that us writing here that a user can follow that tutorial and use our pipeline for build might mean that it stops working in the future when we don't test it permanently, however, we can at least say it, that it is not permanently tested by us and it is best effort from our side.

These are all good points. I am going to separate all of the "Fraud Detection Model" specific changes, this DEVELOPERS.md and Containerfile, into a separate story/work since some of the longterm fixes should be aligned with a complete rewrite of the pipeline, tasks and manifests.

I get that us writing here that a user can follow that tutorial and use our pipeline for build might mean that it stops working in the future when we don't test it permanently,

IMO, this scenario does have a benefit of us dogfooding our own tutorials since this tutorial and official documentation are maintained by another group working on OpenShift AI. If we find a bug, it would be good to file it early in the upstream releases.

Just pointing people somewhere with some promise of experiencing the end-to-end example across two repos, without including the relevant steps in our documentation, will lead to people (including myself) to get stuck with "OK, what next?" questions, either because the steps are not trivial or because the other repo has changed significantly so our assumptions in the Containerfile no longer work.

Folks will either abandon the attempt, or they will file issues here (better) or in the https://github.com/rh-aiservices-bu/fraud-detection (worse, maintainers of that repo don't know anything about this work). even issues filed against this repo will be hard to respond to in a timely and effective manner if we don't have the experience with that other repo among the team and in the CI.

openshift-ci-robot · 2024-04-24T05:09:04Z

@LaVLaS: This pull request references RHOAIENG-2244 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to this:

Description

Updates the fetch-model-s3 task to provide basic support for fetching a model from a sub directory under the S3 bucket removing the hardcoded requirement that all model directories be stored at the root of the S3 bucket.

Adds a universal OpenVino containerfile that can build an inference container image for any supported OpenVino Model format

This provides the ability for users to utilize the OpenShift AI Fraud Detection tutorial in the workflow to train and deploy a model using our pipelines

Jira: RHOAIENG-2244

How Has This Been Tested?

Followed the RHOAI Fraud Detection example to train a model and used that with minio and our Pipeline
OR

Fetch the fraud-detection model from s3://rhoai-edge/example-models/fraud-detection

The test-model-rest-svc must be updated to match output based on the data: not prediction:

The Continerfile.openvino to build the inferencing container must be used with this model

Merge criteria:

Existing pipeline workflows for bike-rental and tensorflow-housing are not affected

The commits are squashed in a cohesive manner and have meaningful messages.

Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).

The developer has manually tested the changes and verified that the changes work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

MarianMacik · 2024-04-25T10:53:16Z

pipelines/tekton/aiedge-e2e/aiedge-e2e.pipeline.yaml

@@ -319,7 +322,7 @@ spec:
        if [ "$(params.upon-end)" == "stop" ]; then
          oc scale deployment.apps/$(params.model-name)-$(params.model-version) --replicas=0
        elif [ "$(params.upon-end)" == "delete" ]; then
-          oc delete all --selector=app=$(params.model-name)-$(params.model-version)
+          oc delete deployment,service --selector=app=$(params.model-name)-$(params.model-version)


Why was this change needed?

MarianMacik · 2024-04-25T10:53:52Z

pipelines/tekton/aiedge-e2e/aiedge-e2e.pipeline.yaml

@@ -255,6 +257,7 @@ spec:
              containers:
              - image: $(params.candidate-image-tag-reference)
                name: $(params.model-name)-$(params.model-version)
+                imagePullPolicy: Always


Is this changed to work in case we push the same label twice?

Correct. I was stuck in a debuging loop testing updates to the Containerfile that were not propagating to the actual test-deployment pod due to a stale directory being present in the running pod although the build step had renamed the folder. Ultimately it appeared that the hash was a much older build so the quickest fix was to set it to Always.

Another option would be to use the digest hash from the previous build for these test pods to ensure we are testing the exact build

Yeah, the digest hash might be a bulletproof option, we can fix it in a separate PR maybe.

MarianMacik · 2024-04-25T10:56:01Z

pipelines/containerfiles/Containerfile.openvino.mlserver.mlflow

@@ -1,3 +1,7 @@
+# This ContainerFile is used to manually convert the MLFlow tensorflow-housing example to a format that works with the OpenVino Model Server
+# The tensorflow-housing example we provide in the ai-edge repo copies the tensorflow-housing/tf2model/{saved_model.pb,variables/} directory to a
+#   format that is compatible with OpenVino Model Server: /model/<MODEL VERSION>/<TENSORFLOW SAVED MODEL FORMAT>


Suggested change

# format that is compatible with OpenVino Model Server: /model/<MODEL VERSION>/<TENSORFLOW SAVED MODEL FORMAT>

# format that is compatible with OpenVino Model Server: /model/<MODEL VERSION>/<TENSORFLOW SAVED MODEL FORMAT>

MarianMacik · 2024-04-25T11:14:01Z

DEVELOPMENT.md

@@ -29,6 +29,9 @@ This repo currently contains two examples:
 | **Model (Pipelines)** | `tensorflow-housing`     | `bike-rentals-auto-ml` |
 | **App (ACM)**         | `tensorflow-housing-app` | `bike-rental-app`      |

+
+*Additionally*, you can follow the [OpenShift AI tutorial - Fraud detection example](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.8/html/openshift_ai_tutorial_-_fraud_detection_example/index) to execute the end to end workflow of training a model from scratch & building the [inferencing container](pipelines/containerfiles/Containerfile.openvino) to deliver to Edge environments.


I get that us writing here that a user can follow that tutorial and use our pipeline for build might mean that it stops working in the future when we don't test it permanently, however, we can at least say it, that it is not permanently tested by us and it is best effort from our side.

LaVLaS added 4 commits April 22, 2024 09:27

Update pipeline S3 workflow to use model relative path

03f6892

Add updated Containerfile for OpenVino model standards

f05f321

Update descriptive text for model-version in pipeline

bb66a3e

Add descriptions and comments to clarify workflows in the Containerfiles

fa5d47f

MarianMacik self-requested a review April 23, 2024 13:32

adelton requested changes Apr 23, 2024

View reviewed changes

LaVLaS changed the title ~~Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile~~ RHOAIENG-2244: Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile Apr 24, 2024

openshift-ci-robot added the jira/valid-reference label Apr 24, 2024

MarianMacik reviewed Apr 25, 2024

View reviewed changes

LaVLaS marked this pull request as draft April 26, 2024 05:14

openshift-ci bot added the do-not-merge/work-in-progress label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RHOAIENG-2244: Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile #254

RHOAIENG-2244: Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile #254

LaVLaS commented Apr 23, 2024 •

edited

openshift-ci bot commented Apr 23, 2024

adelton Apr 23, 2024 •

edited

LaVLaS Apr 23, 2024

adelton Apr 23, 2024

LaVLaS Apr 23, 2024

adelton Apr 29, 2024

adelton Apr 23, 2024

LaVLaS Apr 23, 2024

adelton Apr 23, 2024

LaVLaS Apr 23, 2024

MarianMacik Apr 25, 2024

LaVLaS Apr 26, 2024

adelton Apr 29, 2024

openshift-ci-robot commented Apr 24, 2024 •

edited by openshift-ci bot

Description

How Has This Been Tested?

Merge criteria:

MarianMacik Apr 25, 2024

MarianMacik Apr 25, 2024

LaVLaS Apr 26, 2024

MarianMacik Apr 26, 2024

MarianMacik Apr 25, 2024

MarianMacik Apr 25, 2024

	# format that is compatible with OpenVino Model Server: /model/<MODEL VERSION>/<TENSORFLOW SAVED MODEL FORMAT>
	# format that is compatible with OpenVino Model Server: /model/<MODEL VERSION>/<TENSORFLOW SAVED MODEL FORMAT>

RHOAIENG-2244: Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile #254

Are you sure you want to change the base?

RHOAIENG-2244: Add support for fetching models from S3 based on the modelRelativePath & Adds a universal OpenVino ContainerFile #254

Conversation

LaVLaS commented Apr 23, 2024 • edited

Description

How Has This Been Tested?

Merge criteria:

openshift-ci bot commented Apr 23, 2024

adelton Apr 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Apr 24, 2024 • edited by openshift-ci bot

Description

How Has This Been Tested?

Merge criteria:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LaVLaS commented Apr 23, 2024 •

edited

adelton Apr 23, 2024 •

edited

openshift-ci-robot commented Apr 24, 2024 •

edited by openshift-ci bot