Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check 'closed_' in ExchangeClient::next() #9658

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

spershin
Copy link
Contributor

@spershin spershin commented Apr 29, 2024

Summary:
Using Zombie Tasks we detected that Drviers can end up referenced by the lambdas waiting on the promises to be fulfilled.
Promises given by the Exchange.
Now, when Exchange is being closed, than everything under it (ExchangeClients and ExchangeQueues) are beling closed too, fulfilling any outstanding promises.
The issue is that ExchangeClient allows to new promises being created in the next() call after we are closed().
This creates a situation where these promises are never fulfilled, because there is a proteciton to not call the fulfilling any outstanding promises more than once.

The root cause here is that next() does not respect 'closed_' flag and simply proceeds with asking the underlying ExchnageQueue for data, which in turn creates the promise.

The fix is to check the 'closed_' flag and return straight away.
The fix fixed the Zombie Tasks in the E2E test I was using to reproduce the issue.
GH issue for this: prestodb/presto#22550

Differential Revision: D56712493

Summary:
Using Zombie Tasks we detected that Drviers can end up referenced by the lambdas waiting on the promises to be fulfilled.
Promises given by the Exchange.
Now, when Exchange is being closed, than everything under it (ExchangeClients and ExchangeQueues) are beling closed too, fulfilling any outstanding promises.
The issue is that ExchangeClient allows to new promises being created in the next() call after we are closed().
This creates a situation where these promises are never fulfilled, because there is a proteciton to not call the fulfilling any outstanding promises more than once.

The toot cause here is that next() does not respect 'closed_' flag and simply proceeds with asking the underlying ExchnageQueue for data, which in turn creates the promise.

The fix is to check the 'closed_' flag and return straight away.
The fix fixed the Zombie Tasks in the E2E test I was using to reproduce the issue.
GH issue for this: prestodb/presto#22550

Differential Revision: D56712493
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2024
Copy link

netlify bot commented Apr 29, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d7e1f31
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/662fda58c0089900080fa849

@spershin spershin requested a review from xiaoxmeng April 29, 2024 20:01
@majetideepak
Copy link
Collaborator

This landed on main 8567d4d
but did not close this. @spershin can you check? Thanks.

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spershin thanks for the fix % left a question

@@ -117,6 +117,11 @@ ExchangeClient::next(uint32_t maxBytes, bool* atEnd, ContinueFuture* future) {
std::vector<std::unique_ptr<SerializedPage>> pages;
{
std::lock_guard<std::mutex> l(queue_->mutex());
if (closed_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to have a check here but I think this can't totally prevent this from happening. We probably need closed_ flag inside the queue as queue close is not called with queue lock protection as exchange client close does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxmeng
ExchangeQueue by itself it not protected, instead we rely on protection in the ExchangeClient.
I believe this problem is fully solved.

We will observe the zombie tasks and if anything occurs - we will notice and address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants