DISABLED test_dtensor_op_db_dstack_cpu_float32 (main.TestDTensorOpsCPU) #126493

pytorch-bot · 2024-05-17T01:00:45Z

Platforms: linux

This test was disabled because it is failing in CI. See recent examples and the most recent trunk workflow logs.

Over the past 3 hours, it has been determined flaky in 7 workflow(s) with 21 failures and 7 successes.

Debugging instructions (after clicking on the recent samples link):
DO NOT ASSUME THINGS ARE OKAY IF THE CI IS GREEN. We now shield flaky tests from developers so CI will thus be green but it will be harder to parse the logs.
To find relevant log snippets:

Click on the workflow logs linked above
Click on the Test step of the job so that it is expanded. Otherwise, the grepping will not work.
Grep for test_dtensor_op_db_dstack_cpu_float32
There should be several instances run (as flaky tests are rerun in CI) from which you can study the logs.

Sample error message

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 985, in wrapper
    self._join_threads(self.threads, fn)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 1106, in _join_threads
    cls._check_return_codes(failed_ranks, timeout, fn)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 1143, in _check_return_codes
    raise RuntimeError(error_msg)
RuntimeError: Thread 0 exited with exception:
Traceback (most recent call last):
  File "distributed/_tensor/test_dtensor_ops.py", line 668, in run_dtensor_crossref
    self.assert_ref_dtensor_equal(dtensor_rs, rs)
  File "distributed/_tensor/test_dtensor_ops.py", line 601, in assert_ref_dtensor_equal
    self.assertEqualOnRank(dtensor_r, r)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 1173, in assertEqualOnRank
    self.assertEqual(x, y, msg)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 3639, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 4 / 12 (33.3%)
Greatest absolute difference: 11.370566368103027 at index (0, 1, 1) (up to 1e-05 allowed)
Greatest relative difference: 1.8178166151046753 at index (0, 1, 1) (up to 1.3e-06 allowed)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
    return test(*args, **kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1932, in wrapper
    fn(*args, **kwargs)
  File "distributed/_tensor/test_dtensor_ops.py", line 577, in test_dtensor_op_db
    self.check_dtensor_func(test, op)
  File "distributed/_tensor/test_dtensor_ops.py", line 683, in check_dtensor_func
    test_func()
  File "distributed/_tensor/test_dtensor_ops.py", line 570, in test
    self.run_dtensor_crossref(op.op, args, kwargs)
  File "distributed/_tensor/test_dtensor_ops.py", line 675, in run_dtensor_crossref
    raise RuntimeError(
RuntimeError: failed to run: torch.dstack, with (*[(tensor([ 8.3972,  4.3191, -0.8691, -0.4368], requires_grad=True), tensor([ 5.1155, -6.2551,  2.9920, -2.9822], requires_grad=True), tensor([ 5.2072, -3.2105,  0.4450,  3.0391], requires_grad=True))], **{})

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 1069, in run_test_with_threaded_pg
    getattr(self, test_name)()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 987, in wrapper
    fn()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2756, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1361, in wrapper
    fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 977, in test_wrapper
    raise Exception(  # noqa: TRY002
Exception: Caused by sample input at index 0: SampleInput(input=TensorList[Tensor[size=(4,), device="cpu", dtype=torch.float32], Tensor[size=(4,), device="cpu", dtype=torch.float32], Tensor[size=(4,), device="cpu", dtype=torch.float32]], args=(), kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
     python test/distributed/_tensor/test_dtensor_ops.py -k TestDTensorOpsCPU.test_dtensor_op_db_dstack_cpu_float32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Test file path: distributed/_tensor/test_dtensor_ops.py

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @clee2000 @msaroufim

The text was updated successfully, but these errors were encountered:

pytorch-bot · 2024-05-17T01:00:47Z

Hello there! From the DISABLED prefix in this issue title, it looks like you are attempting to disable a test in PyTorch CI. The information I have parsed is below:

Test name: test_dtensor_op_db_dstack_cpu_float32 (__main__.TestDTensorOpsCPU)
Platforms for which to skip the test: linux
Disabled by pytorch-bot[bot]

Within ~15 minutes, test_dtensor_op_db_dstack_cpu_float32 (__main__.TestDTensorOpsCPU) will be disabled in PyTorch CI for these platforms: linux. Please verify that your test name looks correct, e.g., test_cuda_assert_async (__main__.TestCuda).

To modify the platforms list, please include a line in the issue body, like below. The default action will disable the test for all platforms if no platforms list is specified.

Platforms: case-insensitive, list, of, platforms

We currently support the following platforms: asan, dynamo, inductor, linux, mac, macos, rocm, slow, win, windows.

yf225 · 2024-05-20T17:25:23Z

@wanchaol Please feel free to reassign, thanks!

pytorch-bot bot added module: flaky-tests Problem is a flaky test in CI oncall: distributed Add this issue/PR to distributed oncall triage queue skipped Denotes a (flaky) test currently skipped in CI. labels May 17, 2024

yf225 added the module: dtensor distributed tensor tag label May 20, 2024

yf225 assigned wanchaol May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISABLED test_dtensor_op_db_dstack_cpu_float32 (main.TestDTensorOpsCPU) #126493

DISABLED test_dtensor_op_db_dstack_cpu_float32 (main.TestDTensorOpsCPU) #126493

pytorch-bot bot commented May 17, 2024 •

edited

pytorch-bot bot commented May 17, 2024

yf225 commented May 20, 2024

DISABLED test_dtensor_op_db_dstack_cpu_float32 (__main__.TestDTensorOpsCPU) #126493

DISABLED test_dtensor_op_db_dstack_cpu_float32 (__main__.TestDTensorOpsCPU) #126493

Comments

pytorch-bot bot commented May 17, 2024 • edited

pytorch-bot bot commented May 17, 2024

yf225 commented May 20, 2024

DISABLED test_dtensor_op_db_dstack_cpu_float32 (main.TestDTensorOpsCPU) #126493

DISABLED test_dtensor_op_db_dstack_cpu_float32 (main.TestDTensorOpsCPU) #126493

pytorch-bot bot commented May 17, 2024 •

edited