Improved strategy for dealing with deterministically flaky tests which are order sensitive #125239

ezyang · 2024-04-30T15:37:59Z

🐛 Describe the bug

We have a pretty big flaky test problem in PyTorch CI (at time of writing, there are 389 open disables for flaky tests). Based on work done by @zou3519 et al, we have determined a big class of these flaky issues are due to ordering problems: that is to say, the test deterministically passes 100% when run by itself, but it's only failing when run with some other tests before it. We also suspect that test reordering (e.g., due to target determination) makes these tests flaky. Many flaky tests report 50% fail rate, but folks are consistently unable to reproduce.

If it is true that a test only fails when run with some other tests in a particular order, then it should be simple to reproduce problems locally. The bare minimum we need is:

CI must tell us exactly what order tests were run in on the run that failed
We need some way of running tests in exactly the same order locally

We technically have both of these pieces today. Specifically, the CI logs which tests it is executing from a shard, and using a pytest plugin bundled with https://github.com/asottile/detect-test-pollution you can force tests to execute in a particular error. The rest of the problem is user education and UI.

To give an example, #119747 is a flaky test which folks have investigated, but which doesn't reproduce when you run it by yourself. It does reproduce when you run the same set of tests that ran exactly in CI.

Here is an idealized workflow I am imagining.

CI detects that a test is flaky, because it only fails sometimes or not other times.
Some offline infrastructure checks if the test is deterministically flaky based on test order. Specifically, it needs to rerun the entire test shard, exactly as run the first time, and check if the test fails again. (This is expensive).
Once it has been determined to be deterministically flaky, the offline infrastructure can use a tool like detect-test-pollution to attempt to bisect to a minimal set of tests that have to be run in order to repro the problem
We file a nice issue with clear instructions for how to reproduce the deterministic problem
Developers debug and fix issues
Profit!

We don't have to implement all of the ideal workflow, but in particular, making it easier for people to test (1) is it deterministically failing given a test order and (2) what exact order should I run things in, seems to be important. It also is relatively time consuming (on order of hours) to bisect minimum number of tests that need to be run, so an offline process that can backfill this would also be useful.

Versions

main

cc @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry @ZainRizvi

huydhn · 2024-04-30T17:23:35Z

Here are some thoughts I have:

When a test fails, we rerun it in a new process starting from the failed test case. So, unless the test depends on some global states loading from disk, I assume that it would pass on rerun. This is a strong indicator of a deterministically flaky test due to test order. Maybe we don't need the expensive second step.
There are some context on Tests modify global state cause later tests to fail #110295 to figure out a way to reduce the effect of global states on how a test is run.
Another source of deterministically flaky test is test parallelization where we run at least 2 test files in parallel by default.
We have infra to rerun only disabled tests daily to see if they are still flaky. It looks like dynamo is covered https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rerun_disabled_tests. If an issue like DISABLED test_mm_batching (__main__.TestScript) #119747 isn't closed automatically, it means that it's still failing flakily there. This is where we could extract the test order for reproducibility I think.

cc @clee2000

ezyang · 2024-04-30T17:49:46Z

When a test fails, we rerun it in a new process starting from the failed test case. So, unless the test depends on some global states loading from the disk, I assume that it would pass on rerun. This is a strong indicator of a deterministically flaky test due to test order. Maybe we don't need the expensive second step.

I guess we could optimistically assume that it's test order, if when we rerun it in a new process several times, it doesn't fail again.

We have infra to rerun only disabled tests daily to see if they are still flaky. It looks like dynamo is covered hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rerun_disabled_tests. If an issue like #119747 isn't closed automatically, it means that it's still failing flakily there. This is where we could extract the test order for reproducibility I think.

Not only that, but it's important that the rerun disabled job MUST run the tests in context, because otherwise it might pass (but only because the culprit test that nuked state), and then later we notice it's flaky again.

cpuhrsch added module: ci Related to continuous integration module: tests Issues related to tests (not the torch.testing module) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 30, 2024

huydhn mentioned this issue May 10, 2024

Flaky test_memory_format with nn.BatchNorm2d when running with inductor #125967

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved strategy for dealing with deterministically flaky tests which are order sensitive #125239

Improved strategy for dealing with deterministically flaky tests which are order sensitive #125239

ezyang commented Apr 30, 2024 •

edited by pytorch-bot bot

huydhn commented Apr 30, 2024 •

edited

ezyang commented Apr 30, 2024

Improved strategy for dealing with deterministically flaky tests which are order sensitive #125239

Improved strategy for dealing with deterministically flaky tests which are order sensitive #125239

Comments

ezyang commented Apr 30, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

huydhn commented Apr 30, 2024 • edited

ezyang commented Apr 30, 2024

ezyang commented Apr 30, 2024 •

edited by pytorch-bot bot

huydhn commented Apr 30, 2024 •

edited