Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISABLED test_dtensor_op_db_dstack_cpu_float32 (__main__.TestDTensorOpsCPU) #126493

Open
pytorch-bot bot opened this issue May 17, 2024 · 2 comments
Open
Assignees
Labels
module: dtensor distributed tensor tag module: flaky-tests Problem is a flaky test in CI oncall: distributed Add this issue/PR to distributed oncall triage queue skipped Denotes a (flaky) test currently skipped in CI.

Comments

@pytorch-bot
Copy link

pytorch-bot bot commented May 17, 2024

Platforms: linux

This test was disabled because it is failing in CI. See recent examples and the most recent trunk workflow logs.

Over the past 3 hours, it has been determined flaky in 7 workflow(s) with 21 failures and 7 successes.

Debugging instructions (after clicking on the recent samples link):
DO NOT ASSUME THINGS ARE OKAY IF THE CI IS GREEN. We now shield flaky tests from developers so CI will thus be green but it will be harder to parse the logs.
To find relevant log snippets:

  1. Click on the workflow logs linked above
  2. Click on the Test step of the job so that it is expanded. Otherwise, the grepping will not work.
  3. Grep for test_dtensor_op_db_dstack_cpu_float32
  4. There should be several instances run (as flaky tests are rerun in CI) from which you can study the logs.
Sample error message
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 985, in wrapper
    self._join_threads(self.threads, fn)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 1106, in _join_threads
    cls._check_return_codes(failed_ranks, timeout, fn)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 1143, in _check_return_codes
    raise RuntimeError(error_msg)
RuntimeError: Thread 0 exited with exception:
Traceback (most recent call last):
  File "distributed/_tensor/test_dtensor_ops.py", line 668, in run_dtensor_crossref
    self.assert_ref_dtensor_equal(dtensor_rs, rs)
  File "distributed/_tensor/test_dtensor_ops.py", line 601, in assert_ref_dtensor_equal
    self.assertEqualOnRank(dtensor_r, r)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 1173, in assertEqualOnRank
    self.assertEqual(x, y, msg)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 3639, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 4 / 12 (33.3%)
Greatest absolute difference: 11.370566368103027 at index (0, 1, 1) (up to 1e-05 allowed)
Greatest relative difference: 1.8178166151046753 at index (0, 1, 1) (up to 1.3e-06 allowed)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
    return test(*args, **kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1932, in wrapper
    fn(*args, **kwargs)
  File "distributed/_tensor/test_dtensor_ops.py", line 577, in test_dtensor_op_db
    self.check_dtensor_func(test, op)
  File "distributed/_tensor/test_dtensor_ops.py", line 683, in check_dtensor_func
    test_func()
  File "distributed/_tensor/test_dtensor_ops.py", line 570, in test
    self.run_dtensor_crossref(op.op, args, kwargs)
  File "distributed/_tensor/test_dtensor_ops.py", line 675, in run_dtensor_crossref
    raise RuntimeError(
RuntimeError: failed to run: torch.dstack, with (*[(tensor([ 8.3972,  4.3191, -0.8691, -0.4368], requires_grad=True), tensor([ 5.1155, -6.2551,  2.9920, -2.9822], requires_grad=True), tensor([ 5.2072, -3.2105,  0.4450,  3.0391], requires_grad=True))], **{})

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 1069, in run_test_with_threaded_pg
    getattr(self, test_name)()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 987, in wrapper
    fn()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2756, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1361, in wrapper
    fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 977, in test_wrapper
    raise Exception(  # noqa: TRY002
Exception: Caused by sample input at index 0: SampleInput(input=TensorList[Tensor[size=(4,), device="cpu", dtype=torch.float32], Tensor[size=(4,), device="cpu", dtype=torch.float32], Tensor[size=(4,), device="cpu", dtype=torch.float32]], args=(), kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
     python test/distributed/_tensor/test_dtensor_ops.py -k TestDTensorOpsCPU.test_dtensor_op_db_dstack_cpu_float32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Test file path: distributed/_tensor/test_dtensor_ops.py

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @clee2000 @msaroufim

@pytorch-bot pytorch-bot bot added module: flaky-tests Problem is a flaky test in CI oncall: distributed Add this issue/PR to distributed oncall triage queue skipped Denotes a (flaky) test currently skipped in CI. labels May 17, 2024
Copy link
Author

pytorch-bot bot commented May 17, 2024

Hello there! From the DISABLED prefix in this issue title, it looks like you are attempting to disable a test in PyTorch CI. The information I have parsed is below:
  • Test name: test_dtensor_op_db_dstack_cpu_float32 (__main__.TestDTensorOpsCPU)
  • Platforms for which to skip the test: linux
  • Disabled by pytorch-bot[bot]

Within ~15 minutes, test_dtensor_op_db_dstack_cpu_float32 (__main__.TestDTensorOpsCPU) will be disabled in PyTorch CI for these platforms: linux. Please verify that your test name looks correct, e.g., test_cuda_assert_async (__main__.TestCuda).

To modify the platforms list, please include a line in the issue body, like below. The default action will disable the test for all platforms if no platforms list is specified.

Platforms: case-insensitive, list, of, platforms

We currently support the following platforms: asan, dynamo, inductor, linux, mac, macos, rocm, slow, win, windows.

@yf225 yf225 added the module: dtensor distributed tensor tag label May 20, 2024
@yf225
Copy link
Contributor

yf225 commented May 20, 2024

@wanchaol Please feel free to reassign, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dtensor distributed tensor tag module: flaky-tests Problem is a flaky test in CI oncall: distributed Add this issue/PR to distributed oncall triage queue skipped Denotes a (flaky) test currently skipped in CI.
Projects
None yet
Development

No branches or pull requests

2 participants