Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryLockProvider can cause deadlocks if using nested locks #1226

Closed
mbosecke opened this issue Mar 7, 2024 · 3 comments
Closed

MemoryLockProvider can cause deadlocks if using nested locks #1226

mbosecke opened this issue Mar 7, 2024 · 3 comments

Comments

@mbosecke
Copy link

mbosecke commented Mar 7, 2024

Because MemoryLockProvider uses striped locking, it's possible for two unrelated keys to collide and block. I understand that this was likely done to reduce memory usage but I'm going to make two arguments towards removing the striping and making it unbounded in size:

  1. For this project, I think it's fair to say that performance is one of the top priorities for users, both latency and throughput. Using striped locking will inherently put an upper bound on the throughput (currently hard-coded to 1024) and will cause random and unpredictable latency when there's contention between unrelated keys. It's also very hard to find the root cause when it happens. I'm working with machines that hundreds of gigabytes of memory and hundreds of CPU cores, but this lock provider has put measurable constraints on performance. I'd say that if a user opts-in to using a "MemoryLockProvider" then it's fair for them to assume that it will come with a small cost of memory, and I think the cost of a single ReentrantLock per-thread is more than acceptable.

  2. The provider will cause deadlocks if it's used for nested locks, with no way to defend against it. I noticed this while implementing some changes in Geoserver (see here). Here's an example where none of the locks share keys, and they are acquired/released in a deterministic order, yet a deadlock will occur if the striping is just unfortunate enough:

// Thread 1:
Lock a = getLock("a");
try {
   Lock b = getLock("b"); // BLOCKED BY LOCK C because of striping collision
   try {
        // ... do work
    } finally {
        b.release();
    }
} finally {
   a.release();
}

// Thread 2:
Lock c = getLock("c");
try {
   Lock d = getLock("d"); // BLOCKED BY LOCK A because of striping collision
   try {
        // ... do work
    } finally {
        d.release();
    }
} finally {
   c.release();
}
@aaime
Copy link
Member

aaime commented Mar 8, 2024

Performance is high on the list, for sure, but without sacrificing stability, as these projects are typically running for months at a time. All production setups my company manages always have request limits set up, so that no single request can use more than X memory, and that no more than Y requests can run in parallel (GeoServer service limits, along with control-flow plugins installed), so that one can match the memory given to the JVM.

An OOM is not acceptable, and various measures are taken to prevent it from happening.

@mbosecke
Copy link
Author

mbosecke commented Mar 8, 2024

Fair enough! First of all, let me prefix my response with a huge showing of gratitude for this project and all the contributions you've made towards it, as well as related projects. I've been using this (and GeoServer) for probably a decade now and I admire how well you've led these projects so I fully trust your judgement! And sorry for the wall of text:

I agree that bounding resources is important, but I think it makes more sense to tackle that problem by limiting thread pool sizes instead of limiting lock counts. The former is easier to reason about whereas the latter is complex and requires understanding of how many locks any given thread will have as well as the level of contention.

I used my IDE to do some rough calculations and according to it, the retained size of 1024 ReentrantLock instances is 53 kB, meaning each lock is roughly 53 bytes. One lock really isn't much more more than a small String.

From the perspective of an end-user, I think it makes sense to provide them the ability to limit their number of threads as the solution to bounding resources, and I think it's fair for those users to assume that each thread will come at the cost of some memory (it must!). Let the number of locks organically increase (and decrease) in lock-step with the number of threads. That per-thread memory cost is an implementation detail, and storing one or more ReentrantLocks isn't a huge addition to all the other data structures each thread is already managing (a 53 byte lock is peanuts in comparison to the large image byte arrays).

In fact, right now the array of 1024 ReentrantLocks in MemoryProvider is initialized in the constructor which automatically uses 53 kB of memory, even if someone never uses all those locks. If we assume that a thread typically acquires one lock (ex. on a metatile), and that your average user has less than 1000 threads running, then I'd say that the current implementation is actually more of a memory burden for your average user than it would be if it was a dynamic collection of ReentrantLocks that are lazy-initialized with each thread (and die with the thread). The only users that will see a larger memory footprint from my proposal are those that are running over a thousand concurrent threads (and that's a lot!).

Lastly, I'm pretty sure that any sort of artificial global lock limit will fully prevent you from using locks in a nested way without introducing deadlocks, and I don't see a way around that functional defect. After submitting my recent pull request to the GeoServer project, I noticed that my implementation happened to perfectly match a comment you once made regarding nested locks. I think a global limit is a complete showstopper for those kinds of improvements which would be too bad.

@aaime
Copy link
Member

aaime commented Mar 9, 2024

Yep, fair enough, the bit that I was mostly worried about was indeed the thread pool size, wasn't sure about this one, thank you for the detailed analysis.

aaime pushed a commit that referenced this issue Mar 25, 2024
@aaime aaime closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants