PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb still fails with too many errors after #20156 #20222

akesandgren · 2024-03-27T06:36:15Z

@Flamefire Any ideas here?

I still get 3 errors on my builds.

A40:
WARNING: 3 test failures, 0 test errors (out of 211116):
test_jit 1/1 (1 failed, 2380 passed, 114 skipped, 12 xfailed, 2 rerun)
test_proxy_tensor 1/1 (1 failed, 2078 passed, 613 skipped, 80 xfailed, 2 rerun)
test_nn 1/1 (1 failed, 2798 passed, 128 skipped, 3 xfailed, 2 rerun)

V100:
WARNING: 3 test failures, 0 test errors (out of 210847):
inductor/test_compiled_autograd 1/1 (1 failed, 130 passed, 114 skipped, 2 rerun)
test_proxy_tensor 1/1 (1 failed, 2078 passed, 613 skipped, 80 xfailed, 2 rerun)
test_nn 1/1 (1 failed, 2556 passed, 109 skipped, 3 xfailed, 2 rerun)

A100:
WARNING: 3 test failures, 0 test errors (out of 211120):
test_optim 1/1 (2 failed, 182 passed, 2 skipped, 4 rerun)
test_nn 1/1 (1 failed, 2798 passed, 128 skipped, 3 xfailed, 2 rerun)

lexming · 2024-03-27T08:28:08Z

I would not say that 3 is too many, we allow for 50 failures in the easyconfig because we know that there are many test that are unreliable. Are those 3 failed test very important?

akesandgren · 2024-03-27T08:31:29Z

No clue, I'm more wondering why everybody elses test build passed with only 2 errors...

And we shouldn't have an EC in a release that fails to build. So we either have to increase allowed failures or fix some of the above.

lexming · 2024-03-27T08:39:49Z

I see, this is related to recent changes from #20156
It seems too aggressive to have lowered the number of failed test down to 2.

Flamefire · 2024-03-27T11:24:50Z

Also from the confcall:

test_proxy_tensor likely requires rebuild Z3 as the non-suffixed version now includes the Python bindings.
test_nn is known to fail randomly
test_jit might be an issue with the memory (leak) detector on some systems
the other 2 don't happen for me, no idea here without more details

2 might indeed be to low and was mainly intended to shake out issues in that PR where it seemingly worked well enough.
The idea was to have this as low as possible to work for most such that when more fails a conscious decision can be made to use --ignore-test-failures or --skip-test-step

akesandgren · 2024-04-02T11:25:42Z

The extra V100/A40 problems was due to not having rebuilt one of the dependencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb still fails with too many errors after #20156 #20222

PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb still fails with too many errors after #20156 #20222

akesandgren commented Mar 27, 2024

lexming commented Mar 27, 2024

akesandgren commented Mar 27, 2024 •

edited

lexming commented Mar 27, 2024

Flamefire commented Mar 27, 2024

akesandgren commented Apr 2, 2024

PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb still fails with too many errors after #20156 #20222

PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb still fails with too many errors after #20156 #20222

Comments

akesandgren commented Mar 27, 2024

lexming commented Mar 27, 2024

akesandgren commented Mar 27, 2024 • edited

lexming commented Mar 27, 2024

Flamefire commented Mar 27, 2024

akesandgren commented Apr 2, 2024

akesandgren commented Mar 27, 2024 •

edited