Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb still fails with too many errors after #20156 #20222

Open
akesandgren opened this issue Mar 27, 2024 · 5 comments

Comments

@akesandgren
Copy link
Contributor

@Flamefire Any ideas here?

I still get 3 errors on my builds.

A40:
WARNING: 3 test failures, 0 test errors (out of 211116):
test_jit 1/1 (1 failed, 2380 passed, 114 skipped, 12 xfailed, 2 rerun)
test_proxy_tensor 1/1 (1 failed, 2078 passed, 613 skipped, 80 xfailed, 2 rerun)
test_nn 1/1 (1 failed, 2798 passed, 128 skipped, 3 xfailed, 2 rerun)

V100:
WARNING: 3 test failures, 0 test errors (out of 210847):
inductor/test_compiled_autograd 1/1 (1 failed, 130 passed, 114 skipped, 2 rerun)
test_proxy_tensor 1/1 (1 failed, 2078 passed, 613 skipped, 80 xfailed, 2 rerun)
test_nn 1/1 (1 failed, 2556 passed, 109 skipped, 3 xfailed, 2 rerun)

A100:
WARNING: 3 test failures, 0 test errors (out of 211120):
test_optim 1/1 (2 failed, 182 passed, 2 skipped, 4 rerun)
test_nn 1/1 (1 failed, 2798 passed, 128 skipped, 3 xfailed, 2 rerun)

@lexming
Copy link
Contributor

lexming commented Mar 27, 2024

I would not say that 3 is too many, we allow for 50 failures in the easyconfig because we know that there are many test that are unreliable. Are those 3 failed test very important?

@akesandgren
Copy link
Contributor Author

akesandgren commented Mar 27, 2024

No clue, I'm more wondering why everybody elses test build passed with only 2 errors...

And we shouldn't have an EC in a release that fails to build. So we either have to increase allowed failures or fix some of the above.

@lexming
Copy link
Contributor

lexming commented Mar 27, 2024

I see, this is related to recent changes from #20156
It seems too aggressive to have lowered the number of failed test down to 2.

@Flamefire
Copy link
Contributor

Also from the confcall:

  • test_proxy_tensor likely requires rebuild Z3 as the non-suffixed version now includes the Python bindings.
  • test_nn is known to fail randomly
  • test_jit might be an issue with the memory (leak) detector on some systems
  • the other 2 don't happen for me, no idea here without more details

2 might indeed be to low and was mainly intended to shake out issues in that PR where it seemingly worked well enough.
The idea was to have this as low as possible to work for most such that when more fails a conscious decision can be made to use --ignore-test-failures or --skip-test-step

@akesandgren
Copy link
Contributor Author

The extra V100/A40 problems was due to not having rebuilt one of the dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants