Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh leads to starvation of servers #2341

Open
hosea opened this issue Oct 24, 2023 · 0 comments
Open

Refresh leads to starvation of servers #2341

hosea opened this issue Oct 24, 2023 · 0 comments

Comments

@hosea
Copy link

hosea commented Oct 24, 2023

My Spring-Boot-based Web-application heavily relies on "refresh". Nearly every configuration value can be refreshed.
Now this leads to starvation of some servers after a refresh.

What I can observe from my heapdump:
Nearly all threads are waiting for a readLock that is requested by
org.springframework.cloud.context.scope.GenericScope.LockedScopedProxyFactoryBean#invoke:
"Lock lock = readWriteLock.readLock();"
These threads are all waiting for the same lock as they want to call a method from the same refresh-scoped bean.

And there is one thread that is executing the refresh:
And this thread requests the same lock as writeLock in
org.springframework.cloud.context.scope.GenericScope#destroy:
"Lock lock = this.locks.get(wrapper.getName()).writeLock();"
and it is waiting for the lock.

Some details: My Web-Application is heavily used: there are > 10 Servers in Production and the Bean that is involved is the service to retrieve the advertising for the main page. This takes some time: it determines the ads to show, loads the content of the ads and the pictures from backends. The result can then be rendered in on go on the main page (=> minimal flickering). As it takes some time, this execution is done asynchronous: Main page is shown and ads added later in one go.
Determining the ads is configurable, that's the reason why the bean (service) is a refresh-scoped bean.

Short: heavily used service running on many servers with long-duration service-calls modeled as refresh-scoped bean

What happens during a refresh:
All servers are informed nearly at the same time. Every server tries to destroy the bean. This means:

  • request the write-Lock for the corresponding beanName.
  • This takes some time, because there are several threads using the bean having a readLock for the corresponding beanName and all of this read locks must be returned ("unlocked") before the writeLock can be given.
  • During this time, every new requesting thread cannot get a readLock => State "Waiting". This is the case as a writeLock is requested and waiting to be available.

What you can observe is a kind of "dry out" and an increasing number of waiting threads, followed by high pressure on the backends and a decrease of performance for a while.

Having only one server this behavior may be acceptable. But in a scenario of many servers there is a kind of "wave", that rolls over the servers and potentially kills the weakest server(s).
Servers are not perfectly balanced. So when the first server is "ready" with destroying it causes a high pressure on the backends. This causes a stronger delay for the other servers to continue processing / "drying out". Then the second server is "ready" => the situation becomes a little more worse for the rest of the servers ... and so on. With some bad luck, the server(s) with the highest load run(s) out of available threads => starving.

I think, the root problem is, that destroying a bean also prohibits the access to a new instance for the same beanname. The lock is based on the beanname, not the bean instance.
But as far as I understood the code: this locking is only necessary for ensuring, that no thread is using the bean instance anymore before it is destroyed.
The lock seems to be to hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant