Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model caching #3

Open
nstogner opened this issue Oct 31, 2023 · 5 comments
Open

Model caching #3

nstogner opened this issue Oct 31, 2023 · 5 comments
Assignees

Comments

@nstogner
Copy link
Contributor

nstogner commented Oct 31, 2023

LLMs can be very large in size.

Possible caching implementations:

  1. Peer-to-peer
  • Models pulled from other backends
  • What about when replicas = 0?
  • Perhaps not the best use of GPU nodes
  1. Lingo as pull-through-proxy
  • Lingo could serve a http endpoint that acts as a pull-through-cache for models
  • Might cause performance issues for regular traffic being served by the same Pod (should it be a standalone deployment)?
@samos123
Copy link
Contributor

samos123 commented Nov 5, 2023

The downside of a pull-through-proxy would be that it would require loading a self-signed cert on the model servers. I guess we could inject that cert automatically through a configmap mount. So it's not bad enough to not do it. I will see if I can do a PoC for it.

@nstogner nstogner added this to the 0.2 Release milestone Nov 8, 2023
@alpe
Copy link
Contributor

alpe commented Dec 15, 2023

IMHO this problem seems worth to have it's own solution than building it into lingo.
Asolution can be provided external to k8s like proxys, on the node level (p2p, proxy) or even by a volume provider.
I had boosted some container startups by providing PVs from disk snapshots within GKE for example.

@samos123
Copy link
Contributor

I build out an example in the vllm helm chart that uses GKE ReadManyOnly PV so there is no need to download any model: https://github.com/substratusai/helm/tree/main/charts/vllm#mistral-7b-instruct-on-gke-autopilot-with-readmanyonly-pvc-to-store-model

I'm looking to build a solution that works without too much hassle on any K8s distro. I agree that it could be beneficial outside of Lingo as well especially if we go the path of a caching HTTPS proxy.

I would love to hear your feedback on caching HTTPs proxy where the model servers are configured to use the HTTPs proxy and the https proxy simply caches the models on e.g. local SSD. I felt like that's the most generic solution that we could make work across any K8s providers.

@alpe
Copy link
Contributor

alpe commented Dec 16, 2023

I think a caching proxy is some low hanging fruit that can easily be added to any environment where public models are used. Private models or fine tuned models that are not accessible via http(s) do not benefit from this. Access control and retention are another topic to consider.

A ReadOnlyMany PVC is a good idea to scale containers on a node easily. In the best case, the model provisioning works automatically by the platform so that the container can focused on running the model only.

@samos123
Copy link
Contributor

samos123 commented Jan 13, 2024

I think a private model registry for private models / fine tuned models would be a better fit. There doesn't seem to be a good open source project that works well as a standalone ML Model registry. I found MLFlow model registry but it's all baked together with ML flow: https://mlflow.org/docs/latest/model-registry.html

There is HuggingFace Model registry but that's not open source either afaik.

An open source model registry might be something worth investing in.

The ReadOnlyMany PVC does work well as a cache for either public or private models.

@alpe did you form any updated opinions about this?

I'm on the fence of not doing the caching proxy and instead focusing time on a separate open source private model registry.

@nstogner nstogner removed this from the 0.2 Release milestone Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants