Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any concerns about not locking reads/writes in ICache.GetOrAddAsync()? #720

Closed
jakestrafford opened this issue Jun 13, 2023 · 2 comments
Closed

Comments

@jakestrafford
Copy link

Which version of Duende IdentityServer are you using?
6.2.3
Which version of .NET are you using?
.NET 7.0

Describe the bug

A seemingly “hanging” database query takes down our IdentityServer instance when all subsequent requests for Resources result in Failed to obtain cache lock for: 'Duende.IdentityServer.Services.DefaultCache1[Duende.IdentityServer.Models.Resources]'`

To Reproduce

We don't know yet. It occurred twice in our dev environments during automated testing, two months apart. We appear to have had just one IdentityServer node running during the error.

Expected behavior / Question

We don’t want our IDP to be down if this occurs in production. Are there any suspected risks to disabling the locking logic for reads and writes to the cache in DefaultCache<T>-like implementations of ICache<T>?

In #659, Joe DeCock mentions:

…cached items are updated very infrequently, so locking while they are updated seems like a sensible defensive strategy at face value…

By “defensive strategy”, are you just referring to how you don’t know how thread-safe the customer’s IMemoryCache implementation is, or how expensive the Func<Task<T>> get parameter is?

We’re using stores from Duende.IdentityServer.EntityFramework, and Microsoft’s default IMemoryCache implementation. We’re planning on bypassing the locking logic if the error occurs in the future, buying us time to debug. We have limited time to test our solution, I'm asking here in hopes of discovering potential issues before we run into them.

Log output/exception with stacktrace

Failed to obtain cache lock for: 'Duende.IdentityServer.Services.DefaultCache`1[Duende.IdentityServer.Models.Resources]'

   at Duende.IdentityServer.Services.DefaultCache`1.GetOrAddAsync(String key, TimeSpan duration, Func`1 get) in /_/src/IdentityServer/Services/Default/DefaultCache.cs:line 149
   at Duende.IdentityServer.Stores.CachingResourceStore`1.FindItemsAsync[TItem](IEnumerable`1 names, ICache`1 cache, Func`2 getResourcesFunc, Func`2 getFromResourcesFunc, Func`2 getNameFunc, String allCachePrefix) in /_/src/IdentityServer/Stores/Caching/CachingResourceStore.cs:line 255
   at Duende.IdentityServer.Stores.CachingResourceStore`1.FindIdentityResourcesByScopeNameAsync(IEnumerable`1 scopeNames) in /_/src/IdentityServer/Stores/Caching/CachingResourceStore.cs:line 187
   at Duende.IdentityServer.Stores.IResourceStoreExtensions.FindResourcesByScopeAsync(IResourceStore store, IEnumerable`1 scopeNames) in /_/src/IdentityServer/Extensions/IResourceStoreExtensions.cs:line 38
   at Duende.IdentityServer.Stores.IResourceStoreExtensions.FindEnabledResourcesByScopeAsync(IResourceStore store, IEnumerable`1
...
   at Duende.IdentityServer.Endpoints.AuthorizeEndpointBase.ProcessAuthorizeRequestAsync(NameValueCollection parameters, ClaimsPrincipal user, Boolean checkConsentResponse) in /_/src/IdentityServer/Endpoints/AuthorizeEndpointBase.cs:line 153
...

Additional context

Some evidence that the lock is being held onto for hours:
image

The last “Executing… SELECT * From IdentityResources…” is logged at 9:04:14. We found no evidence of this query ever returning or failing. Our first “Failed to obtain cache lock” is logged at 9:05:18.

Our PostgreSQL database that stores our configuration remained up. Token and session cleanup continued to execute fine all morning. The IdentityResources table was available for queries from my local machine.

Other related posts:
Consider relaxing the cache lock behavior
Failed to obtain cache lock for Exception

*Side note, would you have expected CancellationTokenProvider.CancellationToken to rescue EntityFramework.Storage.ResourceStore’s GetAllResourcesAsync() ? We're still looking into our database timeout settings, perhaps the responsibility lies there.

@josephdecock
Copy link
Member

The intention in the design of the lock is to prevent duplicate calls to the get parameter.

If you remove the locking around that call, the risk is that if the problem occurs again, you'll now have multiple concurrent attempts to perform the get, which might exacerbate the problem. On the other hand, customizing the lock to be tuned for your environment might give you the level of control you need to address this problem.

The lock does have a configurable time out. Perhaps shortening that time out will help. You can also change the database connection that the get function ultimately relies on to change its timeout (and perhaps retry behavior?)

@jakestrafford
Copy link
Author

Thank you, that seems like sensible advice. We'll focus on making our queries more resilient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants