Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClearML serving design v2 #17

Open
bmartinn opened this issue Jan 29, 2022 · 8 comments
Open

ClearML serving design v2 #17

bmartinn opened this issue Jan 29, 2022 · 8 comments

Comments

@bmartinn
Copy link
Member

bmartinn commented Jan 29, 2022

ClearML serving design document v2.0

Goal: Create a simple interface to serve multiple models with scalable serving engines on top of Kubernetes

Design Diagram (edit here)
Untitled-2022-02-01

Features

  • Fully continuous model upgrade/configuration capabilities
  • Separate pre/post processing from model inference (serving engine)
  • Support custom python script per endpoint (pre/post processing)
  • Support multiple model inference serving engine instances
  • Support A/B/Canary testing per Endpoint (i.e. test new versions of the model with probability distribution)
  • Support model monitoring functions
  • Support for 3rd party monitoring plugins
  • Abstract Serving Engine interface
  • REST API with serving engine
  • gRPC interface between pre-processing python code and model inference
    • More efficient encoding than json encode/decode (both compute and network)
  • Performance (i.e. global latency / throughput and model inference latency) logging
    • Optional custom metric reporting
  • Standalone setup for debugging
    • Pre-process (proxy) code (running on host machine) (launching the “Model Inference”)
    • Model inference (serving engine) inside local container
  • Deployment support for Kubernetes
    • Proxy container (with pre-processing code) has kubectl control
    • Serving engine container (model inference) launched by the proxy container
  • Autoscaling inference model engines based on latency

Modules

  • ClearML serving container
    • Singleton instance, acting as the proxy & load balancer
  • ClearML serving Task
    • Stores configuration of a single instance of the Serving container
      • 3rd party plugins
      • Kubernetes config
      • Serving Engine configuration
      • Models / Endpoints
  • Serving Engine
    • Standalone container interacting with the ClearML serving instance
    • ClearML Sidecar configuring the Serving Engine (real-time) & sending reports back
  • ClearML model repository
    • Unique ID per model
    • Links to model files
    • Links to modler pre/post processing code base (git)
    • Supports Tags / Name
    • General purpose key/value meta-data
    • Queryable
  • Configuration CLI
    • Build containers
    • Configure serving system

Usage Example

  • CLI configuring the ClearML serving Task
    • Select initial set of models / endpoints (i.e. endpoint for specific model)
    • Set Kubernetes pod template YAML
      • Job YAML to be used for launching the serving engine container
  • CLI build Kubernetes Job YAML
    • Build the Kubernetes Job YAML to be used to launch the ClearML serving container
    • Add necessary credentials making sure the “ClearML serving container” will be able to launch serving containers
  • Kubectl launching the “ClearML serving container”
    • The “ClearML serving container” will be launching the serving engine containers
  • Once “ClearML serving container” is up, logs are monitored in the ClearML UI
  • Add additional models to a running “ClearML serving container”
    • Provide the “ClearML serving Task”
    • Add/Remove new model UID
@okyspace
Copy link

okyspace commented Feb 5, 2022

@bmartinn nice works.

  1. Can I ask any timeline to roll out this progressively / any branches that I can access while it is built?

  2. From the description, it seems like the design is for inference requests to go through clearml-serving too, thus you added deployment strategies and preprocessing in clearml-serving. What's the thinking for inference requests to go through clearml-serving? Are you thinking to allow extension of clearml-serving by 3rd party to cover all aspects of model serving, such as pre/post processing, model validation, resources monitoring?

  3. I did some part of the works too, so wondering how I can gel with the parallel works done by clearml team.

  • A/B/Canary/Mirrored. Had done up using kubernetes ingress with nginx controller to route / mirror inference requests to another set of triton instances. Simple configs which I can share if you find it useful.
  • clearml-sidecar.
    • I did part of the clearml-sidecar; embedded it in serving engine container and spin up a subprocess to monitor changes in clearml-serving state to update published models periodically.
    • The differences I have is that I do not store the model repo object, metrics as I feel that they are best managed by clearml-serving and serving engine respectively. Was trying to have a single source of truth and not duplicating the info which may have sync issue downstream.
    • was thinking about the unique ID design to identify the serving engine instances, so that i can differentiate or aggregate serving engine metrics at serving instance end. Any thoughts on this yet?

@bmartinn
Copy link
Member Author

bmartinn commented Feb 8, 2022

I'm hoping that late next week I will be able to push a new dev branch with new code to play around with

... What's the thinking for inference requests to go through clearml-serving?

Good point on latency, and obviously this is by choice. The feedback we received, and feel free to add some more, is that pre/post python callbacks are really necessary for a lot of use cases. We opted for this design as it allows users to very easily add pre/post python functions and still use serving engines for the model inference heavy lifting. A good example for preprocessing function would be if the input is a url string, the preprocessing would download the data from the url, encode it (i.e. load it into a numpy array) and pass it to the serving engine to run the model inference itself. postprocessing example would be converting an encoded result to string human readable string representation.

Are you thinking to allow extension of clearml-serving by 3rd party to cover all aspects of model serving, such as pre/post processing, model validation, resources monitoring?

Yes this exactly what we have in mind. Specifically in the diagram the "3rd party plugin" would be an integration to model drift / anomaly detection, either running on the same machine, or sent to an external service.

I did some part of the works too, so wondering how I can gel with the parallel works done by clearml team.

😍

A/B/Canary/Mirrored. ... kubernetes ingress ...

Yes that would be the easiest out of the box solution, I think the major drawback is the ability to easily configure it in realtime?!
My thinking is to have the ability to configure the clearml-serving externally, while it is running. As the current design routes all requests through the clearml-serving instance, it should not be very complicated. wdty? if this is achievable with k8s ingress (and configurable) I think I would prefer k8s ingress to do that, thoughts ?

I did part of the clearml-sidecar; embedded it in serving engine container and spin up a subprocess to monitor changes in clearml-serving state to update published models periodically.

Feel free to post a link to a gir repo / snippet :)

The differences I have is that I do not store the model repo object, metrics ...

I might have failed to illustrate it in the diagram, the idea is Not to store another copy, just reference the Model entity in clearml-server. The idea is to expand the Model entiry so we can store more information on it (right now it is limited to URL/configuration/key-val, I think it makes sense to store performance metrics and links to post/pre code on the model itself. In the beginning though, I think we will store it all on the "clearml-serving" Task, that will also serve as a UID for the clearml-serving instance.

was thinking about the unique ID design to identify the serving engine instances,

Right now we use additional Tasks to do that, every time the "sidecar" spins, it creates a new Task this identifies the serving engine itself (and thus we can store metrics on the serving engine instance performance). wdyt?

@bmartinn
Copy link
Member Author

bmartinn commented Feb 8, 2022

Expanding the original post
Suggested CLI:

ClearML Serving - launch mode serving control plane, to orchestrate all model servings

positional arguments:
  {display,config,model}
                        sub-command help
    display             Display information on the currently running clearml-serving instance

    config              Configure `clearml-serving` instance
              --packages [PACKAGES [PACKAGES ...]]     List of additional packages and versions needed by the model serving pre/post processing code

    model               Configure Model endpoints
              list             List current models
              remove           Remove model by it`s endpoint name
                        --endpoint ENDPOINT  model endpoint name
              add              Add/Update model
                        --endpoint ENDPOINT   Model endpoint (must be unique)
                        --id ID               Specify Static Model ID to be served
                        --name NAME           Specify Model Name/Tag/Project/Published to be selected and served
                        --tags TAGS           Specify Model Name/Tag/Project/Published to be selected and served
                        --project PROJECT     Specify Model Name/Tag/Project/Published to be selected and served
                        --published           Specify Model Name/Tag/Project/Published to be selected and served
                        --preprocess PREPROCESS Specify Pre/Post processing code to be used with the model (point to local file)
                        --input_size INPUT_SIZE [INPUT_SIZE ...] Specify the model matrix input size [Rows x Columns X Channels etc ...]
                        --input_type {float32,float16,uint8} Specify the model matrix input type
                        --output_size OUTPUT_SIZE [OUTPUT_SIZE ...] Specify the model matrix input size [Rows x Columns X Channels etc ...]
                        --output_type {float32,float16,uint8} Specify the model matrix output type

@okyspace
Copy link

okyspace commented Feb 9, 2022

@bmartinn here's the work I am doing which focuses on the "clearml-sidecar". I have termed it "triton-proxy" as I am focusing on triton as the serving engine. https://github.com/okyspace/clearml-serving/tree/triton_proxy/clearml_serving
Not sure if it makes sense to you. Thanks for the info given, will digest them.

@bmartinn
Copy link
Member Author

bmartinn commented Mar 7, 2022

@okyspace Quick update, things are finally wrapping up, here is the latest branch:
https://github.com/allegroai/clearml-serving/tree/dev

Next main step is getting the statistics merged as well :)

@cccs-is
Copy link

cccs-is commented May 24, 2022

@bmartinn Thanks for the new design for ClearML serving.
For some reason I can't locate any configuration for deploying on K8s, although it was mentioned "Deployment support for Kubernetes".
Do you happen to have a yaml/helm available, I don't mind an unofficial one until the proper version is provided in the repo.

@jkhenning
Copy link
Member

@cccs-is
Copy link

cccs-is commented May 25, 2022

Perfect! Thanks @jkhenning! much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants