Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[controller] Fix multiple AdminExecutionTasks working on the same store at the same time. #918

Merged
merged 5 commits into from May 8, 2024

Conversation

haoxu07
Copy link
Contributor

@haoxu07 haoxu07 commented Mar 27, 2024

Summary, imperative, start upper case, don't end with a period

Current state: when we create AdminExecutionTask , we did not check if there is AdminExecutionTask for one store on the fly, so if one AdminExecutionTask is waiting for a lock (e.g. updating store operation waiting for the store level write lock and that lock is currently unavailable), it is possible that AdminConsumptionTask will create multiple AdminExecutionTask s for a single store until they occupy all threads from ExecutorService 's thread pool, they are all waiting for the same store-level lock.

Besides, executorService.invokeAll(tasks, processingCycleTimeoutInMs, TimeUnit.MILLISECONDS)
will cancel every invoked tasks, even the task is emitted by not getting thread from pool to execute. ExecutorService will try to cancel the AdminExecutionTask waiting for a lock after timeout, but that will not terminate that thread (as the acquiring locking operation by AutoCloseableLock is not interruptible). Then that AdminExecutionTask will keep occupying one thread from ExecutorService 's thread pool until the lock is released.

This PR has the following changes to address the issue:

  1. Adding check to see if there is AdminExecutionTask is running for a store.
  2. Integration test to simulate the lock blocking one thread from pool and recover after acquiring the lock.

How was this PR tested?

CI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@haoxu07 haoxu07 changed the title [controller] Fix multiple AdminExecutionTasks could work on the same store at the same time. [controller] Fix multiple AdminExecutionTasks working on the same store at the same time. Mar 29, 2024
@haoxu07 haoxu07 marked this pull request as ready for review March 29, 2024 18:28
@gaojieliu
Copy link
Contributor

I have a few questions:

  1. is it expected to run some admin logic, which is not interruptible? I think we should make all the logic interruptible, otherwise, it can block the graceful shutdown.
  2. Is there any particular type of admin message, which would stuck or take longer time than others?

We face a similar situation when implementing async gauge execution, and we adopted the following strategy, which seems to work well:
tehuti-io/tehuti#31

cc @xunyin8

@haoxu07
Copy link
Contributor Author

haoxu07 commented Apr 4, 2024

I have a few questions:

  1. is it expected to run some admin logic, which is not interruptible? I think we should make all the logic interruptible, otherwise, it can block the graceful shutdown.
  2. Is there any particular type of admin message, which would stuck or take longer time than others?

We face a similar situation when implementing async gauge execution, and we adopted the following strategy, which seems to work well: tehuti-io/tehuti#31

cc @xunyin8

Regarding your questions:
One case for logic non-interruptible is: our current store level lock acquiring inside VeniceHelixAdmin through AutoCloseableLock is not interruptible by using lock.lock():

. When a thread is waiting for the lock, it is not interruptible. We could improve this by using lock.lockInterruptibly(), but I found it will some test failure, which I am trying to figure out.

But thanks for suggested solutions, let me figure out if it works.

Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just some comments on the integration part.

@lluwm
Copy link
Contributor

lluwm commented May 8, 2024

@haoxu07, LGTM and thanks for the fix.

@haoxu07 haoxu07 merged commit 4b5e4ff into linkedin:main May 8, 2024
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants