-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary of tests in machine readable format #126523
Comments
I can move the option to save xml into an argument. The paths to the xmls are generated by run_test.py though, is that ok? |
You mean this code here: pytorch/torch/testing/_internal/common_utils.py Lines 968 to 978 in 7aa853a
To me it looks like all XML files go into the folder passed via pytorch/torch/testing/_internal/common_utils.py Line 1100 in 7aa853a
--save-xml=/tmp/test-reports/report could be used and then all XMLs are in /tmp/test-reports
Main question is: Is that even a feasible approach? I.e. does this work to get all failed tests, especially with the retry logic where I'm not sure if it will skip the "consistently failing" test and still try all others (before and after that failing one) As those XMLs doesn't seem to used (otherwise it would have been already noticed that it got broken) is there any other/better approach? |
Have you tried using --save-xml as an additional unittest arg? Our CI makes xmls and saves them to s3, you can download them on HUD to see what format and file paths they take. There are already a couple of scripts in the repository that parse the xmls to get certain information, you can probably copy a bunch of code from that It doesn't get all failed tests, there are some cases when xmls don't get made #123882 (but I think the log parsing method you used previously would have also not been able to handle this), but it will handle most retries correctly, you will have to dedup based on test name though. I guess it's also worth noting that you could parse the logs since all the test names should be there if you use the verbose option? |
Where does that happen, i.e. where are the xmls created?
There seem to be 2 issues: Force-Terminated tests and reruns. I'd expect the XML files to contain also the rerun tests even currently as the file names get randomized.
IIRC when
We actually do that already to be able to print individual failed tests in addition to the summaries (per test file and total). It even allowed to validate that we got all individual failed tests by checking against the summary. But that already turned out to be error-prone due to the different formats. Some examples:
|
The save xml argument is set by default by the CI environment variable, which is true in our CI. I guess the other option would be for you to set it as well, but it also turns on a couple of other things so you'd have to be careful I'm not sure what a force terminated test is but the case where it doesn't generate xml is when a test segfaults, which leads to xml not getting created and also the === x failed, y skipped === line not getting printed. Then the rerun starts from the segfaulted test so all information about the tests before the segfaulted test gets lost. I can't remember what the previous behavior was, but depending on how far back it was, I'm pretty sure the previous test information still gets lost. Other than this, the reruns should get xml, its just that it'll be in a different file so its a bit more parsing
|
Is that Lines 993 to 997 in b36e018
run_test setting --junit-xml-reruns but the same flag is set in common_utils only when --save-xml is passed https://github.com/pytorch/pytorch/blob/b36e01801b89a516f4271f796773d5f4b43f1186/torch/testing/_internal/common_utils.py#L1108-L1111 and the former only applies to tests run with pytest. So it looks incomplete to me.
Yes I meant test (files) terminated by a signal like SIGSEGV but also SIGIOT (which we have observed)
Yes it was the same here: Most of the information was lost (e.g. that summary line) but some individual tests could be found by parsing the log, but that is fragile, see my previous comment
Just be sure I understood this correctly: For segfaulted tests there won't be any xmls even with the new rerun system? |
馃悰 Describe the bug
We are running the test suite
python run_test.py --continue-through-error
after installing PyTorch on our (many) clusters. For that we need to know how many tests were run, succeeded and failed to make an assessment whether the installation is working at all and especially in that environment. Many of the issues I reported and PRs I submitted are based on that.Until recently we were able to parse the build output containing summary lines such as
This is easy to grep for and even extract manually. However recently (2.3.0 release) we get something like (I'll file a separate bug report for that failure once I know more, but the specific failure is only an example here):
This is mostly due to 3b7d60b which removed the following and replaced the whole logic by a custom pytest plugin rerunning differently (basically: Manually rerun the suite continuing after the first failure until the next, then repeat):
So that cannot be reasonably parsed check for the number of failures or the number of tests anymore.
Additionally, as tests are run in parallel, partial outputs from different tests appear after each other making it virtually impossible to attribute some output to a specific test.
I have seen the option
--save-xml
which seems to be intended for such purposes but that isn't passed byrun_test.py
to the tests spawned by it. Hence reporting this here as a bug instead of only a feature request.The main question: Is there any reliable way to extract the number of failed and run tests after running
run_test.py
?Versions
PyTorch 2.2.0-2.3.1 and main as of now
cc @mruberry @ZainRizvi
The text was updated successfully, but these errors were encountered: