Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add local inference services #40

Open
FelixNgFender opened this issue Apr 4, 2024 · 8 comments
Open

feat: add local inference services #40

FelixNgFender opened this issue Apr 4, 2024 · 8 comments

Comments

@FelixNgFender
Copy link
Owner

https://www.acorn.io/resources/blog/introducing-cog-and-containerizing-machine-learning-models

@MengLinMaker
Copy link

Here there, cool project.

I happen to be working on something kinda similar (for piano to midi transcription): Musidi

I suggest taking a look at parallel processing for audio related ai inferencing. With this approach compute no longer becomes a bottleneck. Instead bandwidth and cold starts are the new bottleneck.
This would massively speed up your transcription speed.

I don't know what model you choose, but the accuracy isn't great.
BTW, I'm randomly scouring the internet for people working on similar stuff.

@FelixNgFender
Copy link
Owner Author

Hi, I really appreciate the suggestion. Could you tell me more about this parallel processing idea? Because in the current setup, I send off my processing tasks to Replicate (a GPU/model provider SaaS) which kicks off the processing and immediately returns the task ID (non-blocking, asynchronous style) and AFAIK, there is no limit to how many tasks I can send off to Replicate at a time (even though there is a rate limit, arbeit very high). So I'm not too sure how to approach parallel processing when moving to local services. I'm intending to use Cog to package the local models BTW (open source model-as-a-container by Replicate).

I'm using Basic Pitch by Spotify. The accuracy isn't great because it's trained to be a general purpose model I think. I will try experiment with the parameters to come up with "profiles" for different instruments.

@MengLinMaker
Copy link

I deployed a modified version of Bytedance's Piano Transcriber to AWS lambda and using CPU inferencing in parallel. It's surprisingly much faster than deploying to Replicate.

According to Replicate, typical inference time would take 6 minutes for this model (audio size is unknown). AWS lambda brings it down to under 30 seconds for an 8 minute audio.

However AWS Lambda has low bandwidth issues. Ideally, I'm trying to find a way to combine both GPU and parallel inferencing.

@FelixNgFender
Copy link
Owner Author

Are you chopping the audio into parts, sending them for processing, and aggregating them in a final Lambda? That sounds interesting.

Replicate has issues with cold starts (up to 3-4 minutes on some rare cases) for less popular models so I'm looking at other options. My vision is to make Mu2Mi self-hostable so all the models and other components should fit on a single host.

For the bandwidth issues, perhaps you can check out scaling out with Modal. They are a PaaS with a "serverless GPU" niche. They provide some nice Python primitives to build out your ML workflows. They also allow you to mix-and-match CPU/GPU for any part of your workflow. I have never used them so these are just points I read on their page. Ray Serve is another open source alternative to that. This one is really nice based on my experience.

@MengLinMaker
Copy link

Yeah, Modal seems very promising. The deployment process seems a little cumbersome (Modal's CEO doesn't like containers). It's on my to do list.

Self hostable is an interesting vision. The challenge is that optimising AI models is very hardware specific. I know that ONNXruntime does some runtime optimisations by checking the hosts's hardware info. How do you plan to make it self hostable?

Also interestingly, Spotify's Basic Pitch Demo is more accurate than Mu2mi for some reason. Maybe there's some pitch configuration? I notice some of the lower pitch notes are being omitted.

@MengLinMaker
Copy link

Oh it seems like Ray has security issues, have to look into it more though.

@FelixNgFender
Copy link
Owner Author

I plan to make it self-hostable through Docker Compose as Cog can make Dockerfile for a pre-defined model. It irons out the kinks of using GPU/CUDA cross-platformly. One of the concerns I have is the size of the generated Docker container. One large model may take up to 20GB, times that by 6 and it's a whole 120GB SSD lol. I think Docker has some caching mechanism for similar build image layers though but I still need to look into that.

Oh it seems like Ray has security issues, have to look into it more though.

Oh, I didn't know that lol. I'll reconsider Ray then.

As for the Basic Pitch demo, I'll take a look at Spotify's demo's parameters. I'm currently quite busy at school until May so that may have to wait a little bit.

@MengLinMaker
Copy link

20GB container sounds absurdly big. How big is your largest model?

I believe the docker cache mainly applies to docker builds. Multistage docker builds will then share the same base layer image when building.

The largest container I had was just over 2GB and the model was 150MB + Pytorch with Librosa. Eventually I quantised the model to 75MB in float16 and used ONNXruntime which fits into the 250MB limit of AWS Lambda (In a docker container, that would be around 1.25 GB).

Of course, I didn't include any GPU libraries through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants