-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: multi GPU crashes backend #359
Comments
If #265 fixes this issue, let me know, and then I will be happy to test it out. |
The error log isn't very helpful. It may give you more info if you kill the server (async moment). It could be due an internal timeout, but hard to tell with this error. |
|
Seems like a timeout error. Did you have a sequence that took longer than 60 seconds to process? As a hotfix, you can increase the timeout threshold: export APHRODITE_ENGINE_ITERATION_TIMEOUT_S=120 That would set the limit to 120 seconds. You can of course pass it as an env variable to the docker image. |
Currently running on AWS. My setup:
And it crashes on a 60sec timeout, meaning it hogs a single GPU instead of distributing the load. |
Normal response time for 1 GPU with 1 client is around 6 seconds. |
Your current environment
馃悰 Describe the bug
When I set
NUM_GPUS
to 8 (due to having a server with 8 GPU's) I get the following error (sorry but the system hates to properly log errors):The annoying part is that the server is not stopped, and the "health" still shows 200 (should not be the case, since the backend crashed).
The text was updated successfully, but these errors were encountered: