Model caching #3

nstogner · 2023-10-31T00:20:07Z

LLMs can be very large in size.

Possible caching implementations:

Peer-to-peer

Models pulled from other backends
What about when replicas = 0?
Perhaps not the best use of GPU nodes

Lingo as pull-through-proxy

Lingo could serve a http endpoint that acts as a pull-through-cache for models
Might cause performance issues for regular traffic being served by the same Pod (should it be a standalone deployment)?

samos123 · 2023-11-05T18:16:15Z

The downside of a pull-through-proxy would be that it would require loading a self-signed cert on the model servers. I guess we could inject that cert automatically through a configmap mount. So it's not bad enough to not do it. I will see if I can do a PoC for it.

alpe · 2023-12-15T12:34:24Z

IMHO this problem seems worth to have it's own solution than building it into lingo.
Asolution can be provided external to k8s like proxys, on the node level (p2p, proxy) or even by a volume provider.
I had boosted some container startups by providing PVs from disk snapshots within GKE for example.

samos123 · 2023-12-15T17:26:53Z

I build out an example in the vllm helm chart that uses GKE ReadManyOnly PV so there is no need to download any model: https://github.com/substratusai/helm/tree/main/charts/vllm#mistral-7b-instruct-on-gke-autopilot-with-readmanyonly-pvc-to-store-model

I'm looking to build a solution that works without too much hassle on any K8s distro. I agree that it could be beneficial outside of Lingo as well especially if we go the path of a caching HTTPS proxy.

I would love to hear your feedback on caching HTTPs proxy where the model servers are configured to use the HTTPs proxy and the https proxy simply caches the models on e.g. local SSD. I felt like that's the most generic solution that we could make work across any K8s providers.

alpe · 2023-12-16T15:03:29Z

I think a caching proxy is some low hanging fruit that can easily be added to any environment where public models are used. Private models or fine tuned models that are not accessible via http(s) do not benefit from this. Access control and retention are another topic to consider.

A ReadOnlyMany PVC is a good idea to scale containers on a node easily. In the best case, the model provisioning works automatically by the platform so that the container can focused on running the model only.

samos123 · 2024-01-13T02:19:25Z

I think a private model registry for private models / fine tuned models would be a better fit. There doesn't seem to be a good open source project that works well as a standalone ML Model registry. I found MLFlow model registry but it's all baked together with ML flow: https://mlflow.org/docs/latest/model-registry.html

There is HuggingFace Model registry but that's not open source either afaik.

An open source model registry might be something worth investing in.

The ReadOnlyMany PVC does work well as a cache for either public or private models.

@alpe did you form any updated opinions about this?

I'm on the fence of not doing the caching proxy and instead focusing time on a separate open source private model registry.

nstogner added this to the 0.2 Release milestone Nov 8, 2023

nstogner assigned samos123 Nov 8, 2023

nstogner removed this from the 0.2 Release milestone Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model caching #3

Model caching #3

nstogner commented Oct 31, 2023 •

edited

samos123 commented Nov 5, 2023

alpe commented Dec 15, 2023 •

edited

samos123 commented Dec 15, 2023

alpe commented Dec 16, 2023

samos123 commented Jan 13, 2024 •

edited

Model caching #3

Model caching #3

Comments

nstogner commented Oct 31, 2023 • edited

samos123 commented Nov 5, 2023

alpe commented Dec 15, 2023 • edited

samos123 commented Dec 15, 2023

alpe commented Dec 16, 2023

samos123 commented Jan 13, 2024 • edited

nstogner commented Oct 31, 2023 •

edited

alpe commented Dec 15, 2023 •

edited

samos123 commented Jan 13, 2024 •

edited