Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible hang with no user code when shutting down threads #3295

Closed
jnm2 opened this issue Jun 28, 2019 · 45 comments
Closed

Reproducible hang with no user code when shutting down threads #3295

jnm2 opened this issue Jun 28, 2019 · 45 comments

Comments

@jnm2
Copy link
Contributor

jnm2 commented Jun 28, 2019

I'm finally tackling a longstanding hang issue with our CI builds at work. What makes me sure that this is an NUnit bug is that whenever the tests hang and I attach a debugger, there is no user code in any thread's call stack. There are always between 5 and 7 worker threads, blocked on the following call:

mscorlib.dll!System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken)
nunit.framework.dll!NUnit.Framework.Internal.Execution.WorkItemQueue.Dequeue()
nunit.framework.dll!NUnit.Framework.Internal.Execution.TestWorker.TestWorkerThreadProc()
mscorlib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx)
mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx)
mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)
mscorlib.dll!System.Threading.ThreadHelper.ThreadStart()
[Native to Managed Transition]

The fact that there are always fewer than 9 threads would seem to indicate that NUnit is in the process of shutdown from the normal eight plus an STA thread every time this happens. At least two worker threads have shut down in every case I've seen so far.

I have a project that reproduces the problem around 2% of the time. It's still 39 tests. I'm in the process of trying to remove things.

@jnm2 jnm2 self-assigned this Jun 28, 2019
@jnm2
Copy link
Contributor Author

jnm2 commented Jun 28, 2019

I really hope this isn't environmental. I was making steady progress removing things without losing the repro, but now I can't get it to repro anymore even if I build older commits along the way that used to repro.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 28, 2019

Okay, it's not that I'm losing the repro, thankfully. It's just getting rarer. I ran the last cycle overnight and I see this morning that it took an hour of back-to-back runs to reproduce.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

Still removing minor parts of the remaining 39 tests but it's hard to do much without losing the repro. Verifying whether I still have it takes anywhere from a few seconds to not much more than an hour max. But an hour is a long time to wait for any feedback on what I just tried.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

All my guesses at how to increase the likelihood of the hang have not panned out, either, but if @CharliePoole or @oznetmaster or anyone else has a theory about what might make the hang more likely, I'd be excited to try it.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

Since the hang is in WorkItemQueue.Dequeue, I thought I'd start looking there. The three things I see are:

  1. // TODO: Originally, the following Guard statement was used. In theory, no queues should be running
    // when we are doing a restore. It appears, however, that we end the shift, pausing queues, buy that
    // a thread may then sneak in and restart some of them. My tests pass without the guard but I'm still
    // concerned to understand what is happening and why. I'm leaving this commented out so that somebody
    // else can take a look at it later on.
    //Guard.OperationValid(State != WorkItemQueueState.Running, $"Attempted to restore state of {Name} while queue was running.");
    var state = _savedState.Pop();

  2. it'll take a while for me to understand why it saves and restores queues.

  3. enough moving parts inside WorkItemQueue (not to mention outside it) that it's a bit over my head to figure out if there is a thread safety issue.

@CharliePoole
Copy link
Contributor

See separate thread on why I think this is ready to be rewritten. 😉

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

Could these lines in the Dequeue loop get into stale reads on every iteration since they aren't using Volatile.Read (volatile for .NET 3.5)?

int cachedRemoveId = _removeId;
int cachedAddId = _addId;

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

@CharliePoole I'm not sure I'm the best person to rewrite it, and I saw Rob's comment about starting from scratch being scary which I tend to agree with. I like your feature switch suggestion. However, you were talking about rewriting the dispatcher. Would it be possible to rewrite the WorkItemQueue class without rewriting anything else?

@CharliePoole
Copy link
Contributor

Worth trying. However, I'm at a point if believing it's time to make sure we have adequate tests and start over re-implementing the dispatcher.

@CharliePoole
Copy link
Contributor

Sure. That has the advantage of being fairly well tested already. You might want to add some more.

IMO there is a lot of value I'm figuring out ways to do the scary stuff without harming production.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

More info: I ran with an older commit of my repro with more tests and the repro happened on the second run. This time only one parallel worker shut down but all the rest are stuck waiting as usual. I took an internal trace and have a debugger attached.

Both the parallel queue and the nonparallel STA queue are empty, have the state Running, and have both addId and removeId equal. (Both 47 for the parallel queue and 77 for the STA queue.)

Three of the test workers have a work item named OneTimeTearDown in spite of the fact that I don't have any one-time teardowns, and the work item state is Ready. The remaining five (incl STA worker) have a work item with a test name and the state Complete.

Can't read the internal trace logs until I kill the process.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

My bad, wrong log. Anyway, out of four runs, the two that reproduced the hang have logs ending with:

11:40:33.823 Debug [17] Dispatcher: Using Direct strategy for BindingSourceMustHaveMatchingViewModelType OneTimeTearDown
11:40:33.823 Debug [17] Dispatcher: Using Parallel strategy for TestBindingSourceTypes OneTimeTearDown
11:40:33.823 Info  [14] TestWorker: ParallelWorker#6 executing TestBindingSourceTypes OneTimeTearDown
11:40:33.823 Debug [14] Dispatcher: Using Direct strategy for Views OneTimeTearDown
11:58:44.773 Debug [17] Dispatcher: Using Direct strategy for BindingSourceMustHaveMatchingViewModelType OneTimeTearDown
11:58:44.773 Debug [17] Dispatcher: Using Parallel strategy for TestBindingSourceTypes OneTimeTearDown
11:58:44.773 Info  [12] TestWorker: ParallelWorker#4 executing TestBindingSourceTypes OneTimeTearDown
11:58:44.773 Debug [12] Dispatcher: Using Direct strategy for Views OneTimeTearDown

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

For comparison, here is what the non-hanging traces look like at the exact same point:

11:58:35.847 Debug [17] Dispatcher: Using Direct strategy for BindingSourceMustHaveMatchingViewModelType OneTimeTearDown
11:58:35.847 Debug [17] Dispatcher: Using Parallel strategy for TestBindingSourceTypes OneTimeTearDown
11:58:35.847 Info  [14] TestWorker: ParallelWorker#6 executing TestBindingSourceTypes OneTimeTearDown
11:58:35.847 Debug [14] Dispatcher: Using Direct strategy for Views OneTimeTearDown
11:58:35.847 Debug [14] Dispatcher: Using Direct strategy for Tests OneTimeTearDown
11:58:35.847 Debug [14] Dispatcher: Using Direct strategy for [Redacted] OneTimeTearDown
11:58:35.847 Debug [14] Dispatcher: Using Direct strategy for Techsola OneTimeTearDown
11:58:35.847 Debug [14] Dispatcher: Using Direct strategy for [Redacted].Tests.dll OneTimeTearDown
11:58:35.851 Info  [14] WorkShift: Parallel shift ending
11:58:35.851 Debug [14] WorkItemQueue: ParallelQueue.0 pausing
11:58:35.851 Debug [14] WorkItemQueue: ParallelSTAQueue.0 pausing
11:58:35.852 Info  [14] WorkItemQueue: ParallelQueue.0 stopping - 48 WorkItems processed
11:58:35.852 Info  [14] WorkItemQueue: ParallelSTAQueue.0 stopping - 77 WorkItems processed
11:58:35.852 Info  [11] TestWorker: ParallelWorker#3 stopping - 12 WorkItems processed.
11:58:35.852 Info  [10] TestWorker: ParallelWorker#2 stopping - 12 WorkItems processed.
11:58:35.852 Info  [14] WorkItemQueue: NonParallelQueue.0 stopping - 0 WorkItems processed
11:58:35.852 Info  [ 9] TestWorker: ParallelWorker#1 stopping - 4 WorkItems processed.
11:58:35.852 Info  [16] TestWorker: ParallelWorker#8 stopping - 3 WorkItems processed.
11:58:35.852 Info  [13] TestWorker: ParallelWorker#5 stopping - 2 WorkItems processed.
11:58:35.852 Info  [14] WorkItemQueue: NonParallelSTAQueue.0 stopping - 0 WorkItems processed
11:58:35.853 Info  [15] TestWorker: ParallelWorker#7 stopping - 3 WorkItems processed.
11:58:35.853 Info  [12] TestWorker: ParallelWorker#4 stopping - 3 WorkItems processed.
11:58:35.853 Info  [14] TestWorker: ParallelWorker#6 stopping - 9 WorkItems processed.
11:58:35.852 Info  [17] TestWorker: ParallelSTAWorker stopping - 77 WorkItems processed.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

Somehow, it's evident that three OneTimeTearDown work items are being dequeued by three test workers. The three test workers then sit with each work item in their _currentWorkItem. Without completing their _currentWorkItem (each work item's state is still Ready), all three test workers call Dequeue again. The queues are all empty but still running, so Dequeue never returns.

@CharliePoole
Copy link
Contributor

I'm not seeing that. There's only one TestWorker active in the code snippet that finishes.Thread 17, which is owned by the Parallel STA worker runs two onetime teardowns, then thread 14 runs the onetime teardown for the last remaining fixture as well as for each enclosing namespace and the dll itself. Those onetime teardowns probably are not doing anything except for flagging completion.

Then thread 14 shuts down the shift, causing all the queues and workers to stop. Thread 17 is the last to stop. In your hung situation, I think something is happening in the onetime teardown for Views is there a SetUpFixture?

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

I just noticed while poking around another hang that the OneTimeTearDown work items (that are Ready and never complete) all have CurrentApartment STA and target apartment Unknown even though they are the _currentWorkItem of three parallel workers.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

@CharliePoole No SetUpFixtures, no one-time setups or teardowns. You've described what happens when there is no hang, but what about when there is a hang? When I attach a debugger I see three test workers in a call to WorkItemQueue.Dequeue which can never return, each with a different OneTimeTearDown work item in the Ready state. Doesn't that mean that each of the three work items were put in each worker's _currentWorkItem, but never completed and somehow all three workers went back and called Dequeue again in spite of that? How else can I explain what I'm seeing in the debugger?

@CharliePoole
Copy link
Contributor

I'll have to look at this when I can be in the code ...a little later this evening.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

Ah, in this repro commit I do have an ITestAction attribute applied to the assembly. I did remove the attribute from later commits which still reproed but not as often.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

If I use --inprocess, I get to see a TargetInvocationException. Not only that, the stack trace printout is interrupted to print out another exception message, this time an AccessViolationException trying to parse the PDB to get line information for the stack trace. This is wild.

Right after that, the final test ran and the repro happened. Only 7 of the 9 worker threads still running.

Maybe the repro is to throw AccessViolationException in a test method?

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

The self-executing test run is up to thousands of invocations of new FrameworkController(), LoadTests, RunTests, and no repro. No exceptions either.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

Looks like a bug in .NET Framework, https://github.com/dotnet/coreclr/issues/19698, "System.AccessViolationException while formatting stacktrace"

(My machine has HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full\Release = 528040, meaning .NET Framework 4.8 on Windows 10 1903.)

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

This intermittent access violation can be reproduced by putting simply new StackTrace(true); as the body of a few test methods. The access violation and the hang always seem to go together.

This is insanely timing-sensitive, though.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

Heh, deleting files was working against me because the more source files are embedded in the test assembly's PDB, the slower System.Diagnostics.StackTraceSymbols.GetSourceLineInfoWithoutCasAssert is and the more likely to result in the access violation bug due to the race condition.

All the tests can be commented out, but more .cs files in the test project = better chance of repro

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

I have a tool that shows the repro rate for a hundred runs (runs four NUnit Console processes at a time) so I can see if a change I make to the project makes it repro better or worse. It's always 10% right now. I want to pull a lot of proprietary code out of the repro that I haven't been able to touch without losing the repro.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

According to this tool, a 5 ms timing difference in the tests preceding the new StackTrace(true); tests makes the difference between a 20% repro rate and a 2% rate.

@CharliePoole
Copy link
Contributor

Sorry, I was going to take a look but other stuff happened. Going to sleep now.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

I appreciate it! I think the cause is a .NET Framework bug enabling a poisoned exception that throws AccessViolationException any time the stack trace is printed or retrieved. This kills one of the test worker threads and writes directly to stderr.

Once I get a sharable repro, I'll start looking for ways to harden NUnit against this .NET Framework bug.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 29, 2019

Pushed a self-contained (if not yet completely minimal) repro to https://github.com/jnm2/NUnitHangRepro.

🎉 It repros on another machine (shows 13% repro rate)! Also .NET Framework 4.8, Windows 10 1903.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 30, 2019

Removing <EmbedAllSources>true</EmbedAllSources> is a workaround if the only PDBs in the output folder are yours.

Another workaround might be a one-time setup in a SetUpFixture to purposely cause an exception in each assembly that has a PDB. It will run before all tests, so there is no chance of a race condition. The problem with this workaround is that it requires a lot of foresight when you have a lot of assemblies to test.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 30, 2019

Removing EmbedAllSources isn't good enough in general. The hang is still happening. Maybe removing that just shifts around the timing, or maybe NUnit.TestAdapter.pdb is messing with it.

I just recalled something I saw last year. There was a performance issue with this same thing, the portable PDB format. I remember there was a switch to disable the file and line info.
That enabled me to dig this up. We could try this in the tests' app.config files:

  <runtime>
    <AppContextSwitchOverrides value="Switch.System.Diagnostics.IgnorePortablePDBsInStackTraces=true" />
  </runtime>

https://github.com/Microsoft/dotnet/blob/master/Documentation/compatibility/Stack-traces-obtained-when-using-portable-PDBs-now-include-source-file-and-line-information-if-requested.md#stack-traces-obtained-when-using-portable-pdbs-now-include-source-file-and-line-information-if-requested

Using the Windows PDB format might also work, though since the NUnit VSTest adapter shows up near the entry point in an exception stack trace, its portable PDB might still throw things.

@jnm2
Copy link
Contributor Author

jnm2 commented Jun 30, 2019

The app.config switch didn't seem to have an effect. Last workaround I can think of: nuke the PDBs in the tests' build output.

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 1, 2019

I made a mistake while verifying the repro. Nuking the PDBs out of all test assembly directories before running tests seems to be a good workaround for now. I will try the app.config switch again too.

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 1, 2019

Just demonstrated again that the app.config workaround doesn't work. It's probably stymied by https://github.com/nunit/nunit-console/pull/443/files (fyi: @nunit/engine-team). The IgnorePortablePDBsInStackTraces=false we added to the console and agent configs are likely overriding the IgnorePortablePDBsInStackTraces=true I'm putting in each test project, in spite of the fact that it's a separate AppDomain. (?)

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 2, 2019

Current repro. This averages a 23% repro rate out of 1000. To maximize the window of time in which the race condition is deadly, the project needs <EmbedAllSources>true</EmbedAllSources> and a very large source file like this.

using System.Diagnostics;
using System.Threading;
using NUnit.Framework;

[assembly: Parallelizable(ParallelScope.Children)]

// 6 or greater makes it more likely for A(35) and A(55) to start at the same time
[assembly: LevelOfParallelism(6)]

namespace HangRepro
{
    public static class ReproTests
    {
        [Test]
        public static void A([Values(35, 55)] int delay)
        {
            Thread.Sleep(delay);

            // Boom
            new StackTrace(true);
            // or e.g. any Shouldly method, since the first thing it does is grab a stack trace.
        }

        [Test]
        public static void B([Range(1, 20)] int _)
        {
            Thread.Sleep(20);
        }
    }
}

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 2, 2019

Currently, I think the real fix is that the console, agents, and NUnitLite need <legacyCorruptedStateExceptionsPolicy enabled="true"/> in their app.config or else NUnit Framework just can't catch AccessViolationException even though the framework already has [HandleProcessCorruptedStateExceptions] where it needs it.

Existing issue: nunit/nunit-console#336

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 2, 2019

Wait, that's not how this works. [HandleProcessCorruptedStateExceptions] works regardless of the .config file. <legacyCorruptedStateExceptionsPolicy> in the console/agent .config does fix the problem though which is a clue.

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 2, 2019

I have no idea why, but the thread simply disappears in spite of the [HandleProcessCorruptedStateExceptions]. The debugger shows the AccessViolationException normally on line 270. When you continue, the debugger shows the exception again but this time as unhandled, in spite of the attribute!

#if !(NET35 || NETSTANDARD1_4)
[HandleProcessCorruptedStateExceptions] //put here to handle C++ exceptions.
#endif
public static object InvokeMethod(MethodInfo method, object fixture, params object[] args)
{
if (method != null)
{
try
{
return method.Invoke(fixture, args);
}
catch (TargetInvocationException e)
{
throw new NUnitException("Rethrown", e.InnerException);
}

If you continue again the console resumes running and the thread is simply gone. If [HandleProcessCorruptedStateExceptions] has no effect, there is nothing we can do to fix this in the framework. It has to be fixed in the console's .config.

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 2, 2019

Aha, this is why:

The HandleProcessCorruptedStateExceptionsAttribute attribute is ignored when it is encountered in partially trusted or transparent code, because a trusted host should not allow an untrusted add-in to catch and ignore these serious exceptions.

We have to choose between partial trust and HandleProcessCorruptedStateExceptionsAttribute.

@rprouse None of the HandleProcessCorruptedStateExceptionsAttributes we've added have had any effect because of #1775. Should we pull them out (or at least mark them as ineffective) or should we remove partial trust?

If we choose to keep partial trust, we are at the mercy of the console/agent .config files to be able to prevent the thread from disappearing. Because the thread disappears, the only way I can think to prevent a hang is to implement a watchdog thread that notices when this happens which can then mark the work item as errored due to the thread being killed by an uncatchable exception. But even this is an odd scenario because normally the console would be killed at the same time as the thread, but we have this line in the console's .config to keep the process alive in spite of the thread dying:

https://github.com/nunit/nunit-console/blame/4ae12183927747e2a35859ad520790f84ebaf5f9/src/NUnitConsole/nunit3-console/app.config#L17

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 2, 2019

I should tag @nunit/framework-team and also @ggeurts (who asked for and added partial trust) and @endrih (who asked for and added handling of exceptions like AccessViolationException). We can't have it both ways.

I think it's safe to say we'll always have corrupted state exceptions, but we won't always have partial trust since it's not recommended by Microsoft. (Recent CPU exploits show just how weak the entire paradigm is, too!)
How do you feel about removing support for partial trust though, now that we have supported it since NUnit framework v3.5?

@ChrisMaddock
Copy link
Member

Great digging here Joseph - I think I agree with your conclusions.

This issue would be another duplicate, right? https://github.com/nunit/nunit-console/issues/644 cc @samcook

@samcook
Copy link

samcook commented Jul 2, 2019

Yep looks like it's probably the same thing I came across too.

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 3, 2019

The team question is now moved to #3301 since this thread is very long and the question was kind of buried. There's nothing more to do in this thread until #3301 is decided. If the decision goes one way, this issue will be fixed along with that one. If it goes the other way, we'll have one final option to explore within this issue: a watchdog thread to prevent the hang when a test worker thread vanishes without marking the work item complete which it owns.

(A watchdog thread will only help when using the console runner though because the default behavior for an unhandled exception is to kill the process, not just a thread. The console runner opts into keeping the process alive but this is unusual and e.g. ReSharper and VSTest's host processes will still be killed.)

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 16, 2019

To everyone watching this thread that is hitting this issue while using the console runner: Use the 3.11.0-dev-04532 or later build of https://www.myget.org/feed/nunit/package/nuget/NUnit.ConsoleRunner. That's one way to fix this issue. (Fix: nunit/nunit-console#660)

It hasn't been confirmed, but it seems likely that we won't be able to fix this on the NUnit Framework side until v4.

@jnm2 jnm2 removed this from the 3.13 milestone May 9, 2020
@jnm2
Copy link
Contributor Author

jnm2 commented May 9, 2020

This was fixed in .NET Framework in February's update:
https://devblogs.microsoft.com/dotnet/net-framework-february-2020-security-and-quality-rollup/

CLR

  • There is a race condition in the portable PDB metadata provider cache that leaked providers and caused crashes in the diagnostic StackTrace API. To fix the race, detect the cause where the provider wasn’t being disposed and dispose it.

@jnm2 jnm2 closed this as completed May 9, 2020
@rprouse rprouse added this to the Closed Without Action milestone May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants