Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Threading.Tests.PeriodicTimerTests.PeriodicTimer_ActiveOperations_TimerRooted failing in CI #59542

Closed
runfoapp bot opened this issue Sep 23, 2021 · 14 comments · Fixed by #68805
Closed
Assignees
Labels
area-System.Threading blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@runfoapp
Copy link

runfoapp bot commented Sep 23, 2021

Runfo Tracking Issue: System.Threading.Tests.PeriodicTimerTests.PeriodicTimer_ActiveOperations_TimerRooted failing in CI

Build Definition Kind Run Name Console Core Dump Test Results Run Client
1742775 runtime Rolling net7.0-Linux-Release-arm64-CoreCLR_checked-(Alpine.314.Arm64.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-arm64v8-20210910135810-8a6f4f3 console.log runclient.py
1741531 runtime PR 68629 net7.0-Linux-Debug-x64-CoreCLR_checked-(Alpine.314.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19 console.log runclient.py
1739014 runtime PR 68563 net7.0-Linux-Debug-x64-CoreCLR_checked-(Alpine.314.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19 console.log runclient.py
1736863 runtime PR 67771 net7.0-Linux-Debug-arm-CoreCLR_checked-(Alpine.314.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-arm32v7-20210910135806-8a6f4f3 console.log runclient.py
1732574 runtime PR 68381 net7.0-Linux-Debug-x64-CoreCLR_checked-(Alpine.314.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19 console.log runclient.py
1731686 runtime PR 68177 net7.0-Linux-Debug-x64-CoreCLR_checked-Ubuntu.1804.Amd64.Open console.log runclient.py
1723729 runtime PR 68126 net7.0-Linux-Debug-x64-CoreCLR_checked-(Alpine.314.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-amd64-20210910135833-1848e19 console.log runclient.py
1721714 runtime PR 68109 net7.0-Linux-Debug-arm-CoreCLR_checked-(Alpine.314.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.14-helix-arm32v7-20210910135806-8a6f4f3 console.log runclient.py
1700818 runtime PR 67367 net7.0-Linux-Debug-x64-CoreCLR_release-(Debian.10.Amd64.Open)Ubuntu.1804.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-10-helix-amd64-20210304164434-56c6673 console.log runclient.py

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
0 2 9
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 23, 2021
@jkoritzinsky jkoritzinsky added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' area-System.Threading labels Sep 23, 2021
@ghost
Copy link

ghost commented Sep 23, 2021

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Runfo Creating Tracking Issue (data being generated)

Author: runfoapp[bot]
Assignees: -
Labels:

area-System.Threading, blocking-clean-ci, untriaged

Milestone: -

@mangod9
Copy link
Member

mangod9 commented Sep 24, 2021

@kouvel

@stephentoub
Copy link
Member

This is very likely another case of us just not being able to test reliably that things are becoming unrooted and available for collection, e.g.
#57751

@krwq
Copy link
Member

krwq commented Oct 1, 2021

@stephentoub do you suggest to disable the test similarly as was done in #57751?

@stephentoub
Copy link
Member

Yeah, I think that makes sense, until we can figure out how to better stabilize these.

@danmoseley
Copy link
Member

Do we have any idea how? Can GC folks indicate whether it’s even possible?

@krwq
Copy link
Member

krwq commented Oct 26, 2021

This has failed again on the first attempt here: https://dev.azure.com/dnceng/public/_build/results?buildId=1440422&view=results - I'm not sure why runfo hasn't updated
EDIT: runfo updated by now

@danmoseley
Copy link
Member

danmoseley commented Jan 19, 2022

@mangod9 can GC folks advise on whether we should expect these patterns to be deterministic?

In this case, it wants to verify a weak reference was collected

        private static void WaitForTimerToBeCollected(WeakReference<PeriodicTimer> timer, bool expected)
        {
            Assert.Equal(expected, SpinWait.SpinUntil(() =>
            {
                GC.Collect();
                return !timer.TryGetTarget(out _);
            }, TimeSpan.FromSeconds(expected ? 5 : 0.5)));
        }

In the issue @stephentoub linked above, it's verifying a finalization happened, with:

            for (int i = 0; i < 10 && !Volatile.Read(ref finalized); i++)
            {
                GC.Collect();
                GC.WaitForPendingFinalizers();
                await Task.Yield();
            }

Patterns like this happen in a number of places in our tests, all assuming that GC.Collect() and GC.WaitForPendingFinalizers() are deterministic if we loop them "a few times" - it would be good to know what we can rely on.

@mangod9
Copy link
Member

mangod9 commented Jan 20, 2022

In general, depending on finalizers to run in a deterministic fashion is probably not a reliable approach. For small tests it should usually be fine, however. Are the intermittent failures mostly timeouts?

@cshung @Maoni0, have you seen reliable mechanisms to achieve what is done above ^^^

@danmoseley
Copy link
Member

@cshung can you advise on q above?

@Maoni0
Copy link
Member

Maoni0 commented Feb 22, 2022

if you are calling GC.Collect(), GC is not in charge of the lifetime. please see this comment.

@danmoseley
Copy link
Member

OK, we need to audit all our tests that do GC.Collect(), and ensure that (or refactor such that):

if you are doing some allocation and want to guarantee the items allocated are not kept alive by hidden JIT references, you must mark the allocating method with [MethodImpl(MethodImplOptions.NoInlining)] and then actually return from the method.

@danmoseley
Copy link
Member

@Danyy427 another one perhaps

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue May 3, 2022
There are two problems with this test
1. `WaitForNextTickAsync` may return a synchronously completed task, in
   which case it does not root the timer, causing our first
   `WaitForTimerToBeCollected` to fail because the timer was collected.
   This problem is easily reproduced by adding a small sleep after
   constructing the `PeriodicTimer` in `Create`, and I believe it is the
   cause of dotnet#59542.
2. There is no guarantee that the timer is not still rooted after the
   wait finishes because the returned `ValueTask<bool>` may be keeping
   it alive. This is unlikely since Roslyn should not extend the
   lifetime of the `ValueTask<bool>` like this across the await, but I
   have introduced another method just to be safe.

Fix dotnet#59542
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 3, 2022
jakobbotsch added a commit that referenced this issue May 3, 2022
There are two problems with this test
1. `WaitForNextTickAsync` may return a synchronously completed task, in
   which case it does not root the timer, causing our first
   `WaitForTimerToBeCollected` to fail because the timer was collected.
   This problem is easily reproduced by adding a small sleep after
   constructing the `PeriodicTimer` in `Create`, and I believe it is the
   cause of #59542.
2. There is no guarantee that the timer is not still rooted after the
   wait finishes because the returned `ValueTask<bool>` may be keeping
   it alive (although, it seems unlikely Roslyn will extend the lifetime across the await like this).
   Fixed by wrapping in another NoInlining method.

Fix #59542
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 3, 2022
@jakobbotsch jakobbotsch self-assigned this May 3, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants