Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vram Limit #135

Open
johncadengo opened this issue Mar 29, 2023 · 1 comment
Open

Vram Limit #135

johncadengo opened this issue Mar 29, 2023 · 1 comment
Labels
type/feature Issue or PR related to a new feature

Comments

@johncadengo
Copy link

Describe the feature you'd like to request

Hi, I'm running into an issue where I cannot load the large model. I have 2 GPUs, each with 8GB Vram. I'm wondering if it's possible to split the work between 2 GPUs?

If not, is it possible to disable the large model from the UX side and default to another size, like medium?

Describe the solution you'd like

No response

@johncadengo johncadengo added the type/feature Issue or PR related to a new feature label Mar 29, 2023
@johncadengo
Copy link
Author

Ok, I found something of a solution here from Whisper's repo: openai/whisper#360 (comment)

I was able to alter transcriber.py to accommodate for this (original file: https://github.com/schibsted/WAAS/blob/main/src/transcriber.py):

from typing import Any

import whisper
from sentry_sdk import set_user

def transcribe(filename: str, requestedModel: str, task: str, language: str, email: str, webhook_id: str) -> dict[str, Any]:
    # Mail is not used here, but it makes it easier for the worker to log mail
    print("Executing transcribing of " + filename + " for "+(email or webhook_id) + " using " + requestedModel + " model ")
    set_user({"email": email})

    # Hack: https://github.com/openai/whisper/discussions/360#discussioncomment-4156445
    model = whisper.load_model(requestedModel, device="cpu")

    model.encoder.to("cuda:0")
    model.decoder.to("cuda:1")

    model.decoder.register_forward_pre_hook(lambda _, inputs: tuple([inputs[0].to("cuda:1"), inputs[1].to("cuda:1")] + list(inputs[2:])))
    model.decoder.register_forward_hook(lambda _, inputs, outputs: outputs.to("cuda:0"))

    return model.transcribe(filename, language=language, task=task)

And I just had to make sure that docker exposed both GPUs to the container.

Might be interesting to have this officially supported, but offering my solution here for anyone who's interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Issue or PR related to a new feature
Projects
None yet
Development

No branches or pull requests

1 participant