-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducible hang with no user code when shutting down threads #3295
Comments
I really hope this isn't environmental. I was making steady progress removing things without losing the repro, but now I can't get it to repro anymore even if I build older commits along the way that used to repro. |
Okay, it's not that I'm losing the repro, thankfully. It's just getting rarer. I ran the last cycle overnight and I see this morning that it took an hour of back-to-back runs to reproduce. |
Still removing minor parts of the remaining 39 tests but it's hard to do much without losing the repro. Verifying whether I still have it takes anywhere from a few seconds to not much more than an hour max. But an hour is a long time to wait for any feedback on what I just tried. |
All my guesses at how to increase the likelihood of the hang have not panned out, either, but if @CharliePoole or @oznetmaster or anyone else has a theory about what might make the hang more likely, I'd be excited to try it. |
Since the hang is in WorkItemQueue.Dequeue, I thought I'd start looking there. The three things I see are:
|
See separate thread on why I think this is ready to be rewritten. 😉 |
Could these lines in the Dequeue loop get into stale reads on every iteration since they aren't using nunit/src/NUnitFramework/framework/Internal/Execution/WorkItemQueue.cs Lines 253 to 254 in a56e685
|
@CharliePoole I'm not sure I'm the best person to rewrite it, and I saw Rob's comment about starting from scratch being scary which I tend to agree with. I like your feature switch suggestion. However, you were talking about rewriting the dispatcher. Would it be possible to rewrite the WorkItemQueue class without rewriting anything else? |
Worth trying. However, I'm at a point if believing it's time to make sure we have adequate tests and start over re-implementing the dispatcher. |
Sure. That has the advantage of being fairly well tested already. You might want to add some more. IMO there is a lot of value I'm figuring out ways to do the scary stuff without harming production. |
More info: I ran with an older commit of my repro with more tests and the repro happened on the second run. This time only one parallel worker shut down but all the rest are stuck waiting as usual. I took an internal trace and have a debugger attached. Both the parallel queue and the nonparallel STA queue are empty, have the state Running, and have both addId and removeId equal. (Both 47 for the parallel queue and 77 for the STA queue.) Three of the test workers have a work item named OneTimeTearDown in spite of the fact that I don't have any one-time teardowns, and the work item state is Ready. The remaining five (incl STA worker) have a work item with a test name and the state Complete. Can't read the internal trace logs until I kill the process. |
My bad, wrong log. Anyway, out of four runs, the two that reproduced the hang have logs ending with:
|
For comparison, here is what the non-hanging traces look like at the exact same point:
|
Somehow, it's evident that three |
I'm not seeing that. There's only one TestWorker active in the code snippet that finishes.Thread 17, which is owned by the Parallel STA worker runs two onetime teardowns, then thread 14 runs the onetime teardown for the last remaining fixture as well as for each enclosing namespace and the dll itself. Those onetime teardowns probably are not doing anything except for flagging completion. Then thread 14 shuts down the shift, causing all the queues and workers to stop. Thread 17 is the last to stop. In your hung situation, I think something is happening in the onetime teardown for Views is there a SetUpFixture? |
I just noticed while poking around another hang that the OneTimeTearDown work items (that are Ready and never complete) all have CurrentApartment STA and target apartment Unknown even though they are the |
@CharliePoole No SetUpFixtures, no one-time setups or teardowns. You've described what happens when there is no hang, but what about when there is a hang? When I attach a debugger I see three test workers in a call to WorkItemQueue.Dequeue which can never return, each with a different OneTimeTearDown work item in the Ready state. Doesn't that mean that each of the three work items were put in each worker's |
I'll have to look at this when I can be in the code ...a little later this evening. |
Ah, in this repro commit I do have an ITestAction attribute applied to the assembly. I did remove the attribute from later commits which still reproed but not as often. |
If I use Right after that, the final test ran and the repro happened. Only 7 of the 9 worker threads still running. Maybe the repro is to throw AccessViolationException in a test method? |
The self-executing test run is up to thousands of invocations of new FrameworkController(), LoadTests, RunTests, and no repro. No exceptions either. |
Looks like a bug in .NET Framework, (My machine has |
This intermittent access violation can be reproduced by putting simply This is insanely timing-sensitive, though. |
Heh, deleting files was working against me because the more source files are embedded in the test assembly's PDB, the slower All the tests can be commented out, but more .cs files in the test project = better chance of repro |
I have a tool that shows the repro rate for a hundred runs (runs four NUnit Console processes at a time) so I can see if a change I make to the project makes it repro better or worse. It's always 10% right now. I want to pull a lot of proprietary code out of the repro that I haven't been able to touch without losing the repro. |
According to this tool, a 5 ms timing difference in the tests preceding the |
Sorry, I was going to take a look but other stuff happened. Going to sleep now. |
I appreciate it! I think the cause is a .NET Framework bug enabling a poisoned exception that throws AccessViolationException any time the stack trace is printed or retrieved. This kills one of the test worker threads and writes directly to stderr. Once I get a sharable repro, I'll start looking for ways to harden NUnit against this .NET Framework bug. |
Pushed a self-contained (if not yet completely minimal) repro to https://github.com/jnm2/NUnitHangRepro. 🎉 It repros on another machine (shows 13% repro rate)! Also .NET Framework 4.8, Windows 10 1903. |
Removing Another workaround might be a one-time setup in a SetUpFixture to purposely cause an exception in each assembly that has a PDB. It will run before all tests, so there is no chance of a race condition. The problem with this workaround is that it requires a lot of foresight when you have a lot of assemblies to test. |
Removing EmbedAllSources isn't good enough in general. The hang is still happening. Maybe removing that just shifts around the timing, or maybe NUnit.TestAdapter.pdb is messing with it. I just recalled something I saw last year. There was a performance issue with this same thing, the portable PDB format. I remember there was a switch to disable the file and line info. <runtime>
<AppContextSwitchOverrides value="Switch.System.Diagnostics.IgnorePortablePDBsInStackTraces=true" />
</runtime> Using the Windows PDB format might also work, though since the NUnit VSTest adapter shows up near the entry point in an exception stack trace, its portable PDB might still throw things. |
The app.config switch didn't seem to have an effect. Last workaround I can think of: nuke the PDBs in the tests' build output. |
I made a mistake while verifying the repro. Nuking the PDBs out of all test assembly directories before running tests seems to be a good workaround for now. I will try the app.config switch again too. |
Just demonstrated again that the app.config workaround doesn't work. It's probably stymied by https://github.com/nunit/nunit-console/pull/443/files (fyi: @nunit/engine-team). The |
Current repro. This averages a 23% repro rate out of 1000. To maximize the window of time in which the race condition is deadly, the project needs using System.Diagnostics;
using System.Threading;
using NUnit.Framework;
[assembly: Parallelizable(ParallelScope.Children)]
// 6 or greater makes it more likely for A(35) and A(55) to start at the same time
[assembly: LevelOfParallelism(6)]
namespace HangRepro
{
public static class ReproTests
{
[Test]
public static void A([Values(35, 55)] int delay)
{
Thread.Sleep(delay);
// Boom
new StackTrace(true);
// or e.g. any Shouldly method, since the first thing it does is grab a stack trace.
}
[Test]
public static void B([Range(1, 20)] int _)
{
Thread.Sleep(20);
}
}
} |
Currently, I think the real fix is that the console, agents, and NUnitLite need Existing issue: nunit/nunit-console#336 |
Wait, that's not how this works. |
I have no idea why, but the thread simply disappears in spite of the nunit/src/NUnitFramework/framework/Internal/Reflect.cs Lines 261 to 275 in a56e685
If you continue again the console resumes running and the thread is simply gone. If |
Aha, this is why:
We have to choose between partial trust and HandleProcessCorruptedStateExceptionsAttribute. @rprouse None of the HandleProcessCorruptedStateExceptionsAttributes we've added have had any effect because of #1775. Should we pull them out (or at least mark them as ineffective) or should we remove partial trust? If we choose to keep partial trust, we are at the mercy of the console/agent .config files to be able to prevent the thread from disappearing. Because the thread disappears, the only way I can think to prevent a hang is to implement a watchdog thread that notices when this happens which can then mark the work item as errored due to the thread being killed by an uncatchable exception. But even this is an odd scenario because normally the console would be killed at the same time as the thread, but we have this line in the console's .config to keep the process alive in spite of the thread dying: |
I should tag @nunit/framework-team and also @ggeurts (who asked for and added partial trust) and @endrih (who asked for and added handling of exceptions like AccessViolationException). We can't have it both ways. I think it's safe to say we'll always have corrupted state exceptions, but we won't always have partial trust since it's not recommended by Microsoft. (Recent CPU exploits show just how weak the entire paradigm is, too!) |
Great digging here Joseph - I think I agree with your conclusions. This issue would be another duplicate, right? https://github.com/nunit/nunit-console/issues/644 cc @samcook |
Yep looks like it's probably the same thing I came across too. |
The team question is now moved to #3301 since this thread is very long and the question was kind of buried. There's nothing more to do in this thread until #3301 is decided. If the decision goes one way, this issue will be fixed along with that one. If it goes the other way, we'll have one final option to explore within this issue: a watchdog thread to prevent the hang when a test worker thread vanishes without marking the work item complete which it owns. (A watchdog thread will only help when using the console runner though because the default behavior for an unhandled exception is to kill the process, not just a thread. The console runner opts into keeping the process alive but this is unusual and e.g. ReSharper and VSTest's host processes will still be killed.) |
To everyone watching this thread that is hitting this issue while using the console runner: Use the It hasn't been confirmed, but it seems likely that we won't be able to fix this on the NUnit Framework side until v4. |
This was fixed in .NET Framework in February's update:
|
I'm finally tackling a longstanding hang issue with our CI builds at work. What makes me sure that this is an NUnit bug is that whenever the tests hang and I attach a debugger, there is no user code in any thread's call stack. There are always between 5 and 7 worker threads, blocked on the following call:
The fact that there are always fewer than 9 threads would seem to indicate that NUnit is in the process of shutdown from the normal eight plus an STA thread every time this happens. At least two worker threads have shut down in every case I've seen so far.
I have a project that reproduces the problem around 2% of the time. It's still 39 tests. I'm in the process of trying to remove things.
The text was updated successfully, but these errors were encountered: