Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved strategy for dealing with deterministically flaky tests which are order sensitive #125239

Open
ezyang opened this issue Apr 30, 2024 · 2 comments
Labels
module: ci Related to continuous integration module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ezyang
Copy link
Contributor

ezyang commented Apr 30, 2024

馃悰 Describe the bug

We have a pretty big flaky test problem in PyTorch CI (at time of writing, there are 389 open disables for flaky tests). Based on work done by @zou3519 et al, we have determined a big class of these flaky issues are due to ordering problems: that is to say, the test deterministically passes 100% when run by itself, but it's only failing when run with some other tests before it. We also suspect that test reordering (e.g., due to target determination) makes these tests flaky. Many flaky tests report 50% fail rate, but folks are consistently unable to reproduce.

If it is true that a test only fails when run with some other tests in a particular order, then it should be simple to reproduce problems locally. The bare minimum we need is:

  1. CI must tell us exactly what order tests were run in on the run that failed
  2. We need some way of running tests in exactly the same order locally

We technically have both of these pieces today. Specifically, the CI logs which tests it is executing from a shard, and using a pytest plugin bundled with https://github.com/asottile/detect-test-pollution you can force tests to execute in a particular error. The rest of the problem is user education and UI.

To give an example, #119747 is a flaky test which folks have investigated, but which doesn't reproduce when you run it by yourself. It does reproduce when you run the same set of tests that ran exactly in CI.

Here is an idealized workflow I am imagining.

  1. CI detects that a test is flaky, because it only fails sometimes or not other times.
  2. Some offline infrastructure checks if the test is deterministically flaky based on test order. Specifically, it needs to rerun the entire test shard, exactly as run the first time, and check if the test fails again. (This is expensive).
  3. Once it has been determined to be deterministically flaky, the offline infrastructure can use a tool like detect-test-pollution to attempt to bisect to a minimal set of tests that have to be run in order to repro the problem
  4. We file a nice issue with clear instructions for how to reproduce the deterministic problem
  5. Developers debug and fix issues
  6. Profit!

We don't have to implement all of the ideal workflow, but in particular, making it easier for people to test (1) is it deterministically failing given a test order and (2) what exact order should I run things in, seems to be important. It also is relatively time consuming (on order of hours) to bisect minimum number of tests that need to be run, so an offline process that can backfill this would also be useful.

Versions

main

cc @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry @ZainRizvi

@huydhn
Copy link
Contributor

huydhn commented Apr 30, 2024

Here are some thoughts I have:

cc @clee2000

@ezyang
Copy link
Contributor Author

ezyang commented Apr 30, 2024

When a test fails, we rerun it in a new process starting from the failed test case. So, unless the test depends on some global states loading from the disk, I assume that it would pass on rerun. This is a strong indicator of a deterministically flaky test due to test order. Maybe we don't need the expensive second step.

I guess we could optimistically assume that it's test order, if when we rerun it in a new process several times, it doesn't fail again.

We have infra to rerun only disabled tests daily to see if they are still flaky. It looks like dynamo is covered hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rerun_disabled_tests. If an issue like #119747 isn't closed automatically, it means that it's still failing flakily there. This is where we could extract the test order for reproducibility I think.

Not only that, but it's important that the rerun disabled job MUST run the tests in context, because otherwise it might pass (but only because the culprit test that nuked state), and then later we notice it's flaky again.

@cpuhrsch cpuhrsch added module: ci Related to continuous integration module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ci Related to continuous integration module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Prioritized - max 15
Development

No branches or pull requests

3 participants