Refresh leads to starvation of servers #2341

hosea · 2023-10-24T23:36:54Z

My Spring-Boot-based Web-application heavily relies on "refresh". Nearly every configuration value can be refreshed.
Now this leads to starvation of some servers after a refresh.

What I can observe from my heapdump:
Nearly all threads are waiting for a readLock that is requested by
org.springframework.cloud.context.scope.GenericScope.LockedScopedProxyFactoryBean#invoke:
"Lock lock = readWriteLock.readLock();"
These threads are all waiting for the same lock as they want to call a method from the same refresh-scoped bean.

And there is one thread that is executing the refresh:
And this thread requests the same lock as writeLock in
org.springframework.cloud.context.scope.GenericScope#destroy:
"Lock lock = this.locks.get(wrapper.getName()).writeLock();"
and it is waiting for the lock.

Some details: My Web-Application is heavily used: there are > 10 Servers in Production and the Bean that is involved is the service to retrieve the advertising for the main page. This takes some time: it determines the ads to show, loads the content of the ads and the pictures from backends. The result can then be rendered in on go on the main page (=> minimal flickering). As it takes some time, this execution is done asynchronous: Main page is shown and ads added later in one go.
Determining the ads is configurable, that's the reason why the bean (service) is a refresh-scoped bean.

Short: heavily used service running on many servers with long-duration service-calls modeled as refresh-scoped bean

What happens during a refresh:
All servers are informed nearly at the same time. Every server tries to destroy the bean. This means:

request the write-Lock for the corresponding beanName.
This takes some time, because there are several threads using the bean having a readLock for the corresponding beanName and all of this read locks must be returned ("unlocked") before the writeLock can be given.
During this time, every new requesting thread cannot get a readLock => State "Waiting". This is the case as a writeLock is requested and waiting to be available.

What you can observe is a kind of "dry out" and an increasing number of waiting threads, followed by high pressure on the backends and a decrease of performance for a while.

Having only one server this behavior may be acceptable. But in a scenario of many servers there is a kind of "wave", that rolls over the servers and potentially kills the weakest server(s).
Servers are not perfectly balanced. So when the first server is "ready" with destroying it causes a high pressure on the backends. This causes a stronger delay for the other servers to continue processing / "drying out". Then the second server is "ready" => the situation becomes a little more worse for the rest of the servers ... and so on. With some bad luck, the server(s) with the highest load run(s) out of available threads => starving.

I think, the root problem is, that destroying a bean also prohibits the access to a new instance for the same beanname. The lock is based on the beanname, not the bean instance.
But as far as I understood the code: this locking is only necessary for ensuring, that no thread is using the bean instance anymore before it is destroyed.
The lock seems to be to hard.

hosea added the waiting-for-triage label Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh leads to starvation of servers #2341

Refresh leads to starvation of servers #2341

hosea commented Oct 24, 2023 •

edited

Refresh leads to starvation of servers #2341

Refresh leads to starvation of servers #2341

Comments

hosea commented Oct 24, 2023 • edited

hosea commented Oct 24, 2023 •

edited