New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check 'closed_' in ExchangeClient::next() #9658
base: main
Are you sure you want to change the base?
Conversation
Summary: Using Zombie Tasks we detected that Drviers can end up referenced by the lambdas waiting on the promises to be fulfilled. Promises given by the Exchange. Now, when Exchange is being closed, than everything under it (ExchangeClients and ExchangeQueues) are beling closed too, fulfilling any outstanding promises. The issue is that ExchangeClient allows to new promises being created in the next() call after we are closed(). This creates a situation where these promises are never fulfilled, because there is a proteciton to not call the fulfilling any outstanding promises more than once. The toot cause here is that next() does not respect 'closed_' flag and simply proceeds with asking the underlying ExchnageQueue for data, which in turn creates the promise. The fix is to check the 'closed_' flag and return straight away. The fix fixed the Zombie Tasks in the E2E test I was using to reproduce the issue. GH issue for this: prestodb/presto#22550 Differential Revision: D56712493
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spershin thanks for the fix % left a question
@@ -117,6 +117,11 @@ ExchangeClient::next(uint32_t maxBytes, bool* atEnd, ContinueFuture* future) { | |||
std::vector<std::unique_ptr<SerializedPage>> pages; | |||
{ | |||
std::lock_guard<std::mutex> l(queue_->mutex()); | |||
if (closed_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good to have a check here but I think this can't totally prevent this from happening. We probably need closed_ flag inside the queue as queue close is not called with queue lock protection as exchange client close does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxmeng
ExchangeQueue by itself it not protected, instead we rely on protection in the ExchangeClient.
I believe this problem is fully solved.
We will observe the zombie tasks and if anything occurs - we will notice and address it.
Summary:
Using Zombie Tasks we detected that Drviers can end up referenced by the lambdas waiting on the promises to be fulfilled.
Promises given by the Exchange.
Now, when Exchange is being closed, than everything under it (ExchangeClients and ExchangeQueues) are beling closed too, fulfilling any outstanding promises.
The issue is that ExchangeClient allows to new promises being created in the next() call after we are closed().
This creates a situation where these promises are never fulfilled, because there is a proteciton to not call the fulfilling any outstanding promises more than once.
The root cause here is that next() does not respect 'closed_' flag and simply proceeds with asking the underlying ExchnageQueue for data, which in turn creates the promise.
The fix is to check the 'closed_' flag and return straight away.
The fix fixed the Zombie Tasks in the E2E test I was using to reproduce the issue.
GH issue for this: prestodb/presto#22550
Differential Revision: D56712493