-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potentially Misbehaving GPU #10660
Comments
The GPU has 6GBs, but OOM happens at few hundred MBs. Something is off, maybe reboot? Here your screenshot also shows 5GB taken by something else. |
As soon as XLA allocates, it goes to 5gb upfront.
I'll reboot and try again. I thought something was off yes. Thanks!
…On Mon, 18 Mar 2024, 19:52 George Karpenkov, ***@***.***> wrote:
The GPU has 6GBs, but OOM happens at few hundred MBs. Something is off,
maybe reboot? Here your screenshot also shows 5GB taken by something else.
—
Reply to this email directly, view it on GitHub
<#10660 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGI5FN4MOEROUVRW26BE2DYY4ZVJAVCNFSM6AAAAABE4BSWICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGY4DQNJTGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@cheshire confirmed - after a reboot same symptoms. As soon as I start a playbook with XLA related code, the GPU reserves 5406MB of memory. I'm aware that this might very well not be XLA related... But given the knowledge I've seen around here, I'd still like to ask :)
An interesting bit here... |
@cheshire ok so I rebooted, and tried again. I will put a screenshot here because that's from the Livebook and I can't copy-paste... What's really weird is the upfront allocation of this particular size: Do you think the drivers on the host are misbehaving? |
Well actually - it seems to be behaving slightly differently now...
Trying a text summary with BART. |
Yes, it's expected, the BFC allocator gets all the memory. But then it's failing to allocate a few hundred MBs, so something is interfering. Try to run with |
Ok so I'm trying now.
|
Still crashes - but not crashing my terminal and killing some of my apps. It's just failing somewhere, and my supervision tree kicks-in and restarts it. Logs are too much to attach here... @cheshire could you on top of your head recommend a model that would be "known to work" with this kind of GPU? For example, a small Thanks a lot! |
This issue might stem from this one - elixir-nx/xla#80 - where the lower level setup isn't right.
It seems to me that something is misbehaving. As per an advice I received, I thought I could see how Livebook behaves on the GPU. I configured container spec for NVIDIA along with docker, and am running something like this:
I'm trying to use "out of the box" Smart Cells that involve Nx / Bumblebee / XLA.
But as soon as I try to run one, the GPU goes OOM and it seems to me that it could take much more than that (It has around 6GB of memory).
I've tried various options (Eg model
backend: {EXLA.Backend, client: :host}
along with different combinations ofdefn_options: [compiler: EXLA, lazy_transfers: :always]
but still it just seems to postpone the crash (When the inference runs vs when the model loads).Error looks like this:
The GPU in this laptop is as follows (definetely not a desktop GPU, but still high end for a laptop):
Is there any way for me to reliably know if my setup is right (Make sure that the GPU is indeed undersized, or see if something is inherently wrong with the lower level setup)? For example a livebook having parameters known to work within spec of this GPU?
The text was updated successfully, but these errors were encountered: