New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native] Zombie tasks might not get cleaned up #22550
Comments
@czentgr : Just checking you have facebookincubator/velox#9207 in the build. #22129 (comment) is related to the sequence of events you saw as well. |
We created an internal task for this and will take a look. |
@aditi-pandit I was using my memory debug build and that has an older commit base which doesn't include this fix (the build I collected with was initially new but then I switched back to my memory debug build). I will retry with the fix. |
I can repro the issue with velox commit https://github.com/facebookincubator/velox/commits/da6a3d305bd92a2ab40d2f013ed191f261a92732
Presto commit https://github.com/prestodb/presto/commits/45387f9956eb899a93b96eaccf73b239102ee3fe
I'm using my rebased memory debug build after running the query twice:
But just to be sure I'll retry with a clean build from the above commits. |
Confirming repro on current code from presto master branch:
|
@czentgr I ran the following query multiple times in one of our clusters and not seeing any zombies:
Maybe specific data affects zombification somehow? Have you tried to run it w/o limit and see if there are any zombie tasks left? |
@spershin Yes, it is related to the limit clause. The limit is pushed down to the partial limit in Stage 1 - we can see that in the pan fragment:
The tablescan feeding into this is from stage 2 which has 2+ million rows.
So I think you need to make sure the tablescan has enough data to read and is still busy when the limit is being processed. And yes, removing the limit clause will not result in zombies. |
We do observe zombie tasks occasionally for some queries in Meta. @czentgr |
@spershin I've pushed a branch with an E2E unit test that reproduces the problem. Once you've run this you have the data and setup to run your own coordinator and worker. I hope this helps. Before running the test export My branch is https://github.com/czentgr/presto/tree/cz_zombie_repro . The test will not create the data if the hive storage has the DATE_DIM and STORE_SALES already. So make you remove the directories (aka delete the tables). You probably know this already. By default they are in your source path to presto:
I had the tiny tables previously from other tests and had to clean them up. The generation of the SF10 data doesn't take too long - I think it was 15-20 mins on my laptop for the first time. The data should look like this once generated:
The test run will fail with
because the zombie task is present
Note, the log dir ( Please let me know if you have questions. |
We are dealing with race condition. Facts:
I will need to look at the code more and figure out if we need a closed_ flag in the ExchangeQueue itself or protection simply broken at the level of ExchangeClient. |
Confirmed that chain of calls (note that member functions are renamed a bit here to navigate better among dozens of close() functions).
Is in not protected in any way agains the exchange client being closed:
Simple fix would be adding
in ExchangeClient::nextPage right after the lock. Need to come up with a unit test for that. |
Summary: Using Zombie Tasks we detected that Drviers can end up referenced by the lambdas waiting on the promises to be fulfilled. Promises given by the Exchange. Now, when Exchange is being closed, than everything under it (ExchangeClients and ExchangeQueues) are beling closed too, fulfilling any outstanding promises. The issue is that ExchangeClient allows to new promises being created in the next() call after we are closed(). This creates a situation where these promises are never fulfilled, because there is a proteciton to not call the fulfilling any outstanding promises more than once. The toot cause here is that next() does not respect 'closed_' flag and simply proceeds with asking the underlying ExchnageQueue for data, which in turn creates the promise. The fix is to check the 'closed_' flag and return straight away. The fix fixed the Zombie Tasks in the E2E test I was using to reproduce the issue. GH issue for this: prestodb/presto#22550 Differential Revision: D56712493
Summary: Using Zombie Tasks we detected that Drviers can end up referenced by the lambdas waiting on the promises to be fulfilled. Promises given by the Exchange. Now, when Exchange is being closed, than everything under it (ExchangeClients and ExchangeQueues) are beling closed too, fulfilling any outstanding promises. The issue is that ExchangeClient allows to new promises being created in the next() call after we are closed(). This creates a situation where these promises are never fulfilled, because there is a proteciton to not call the fulfilling any outstanding promises more than once. The toot cause here is that next() does not respect 'closed_' flag and simply proceeds with asking the underlying ExchnageQueue for data, which in turn creates the promise. The fix is to check the 'closed_' flag and return straight away. The fix fixed the Zombie Tasks in the E2E test I was using to reproduce the issue. GH issue for this: prestodb/presto#22550 Reviewed By: Yuhta Differential Revision: D56712493 fbshipit-source-id: 8808f854872b68c5c29bdd67daceb656f92da8f0
This issue should be fixed now. |
Describe the problem you faced
I'm running a query where one stage is cancelled and as a result one of three tasks ends up in the aborted state. However, the tasks is not cleaned up after having been successfully run (and all results returned). Instead, the tasks hangs around. I can run the same query multiple times and the same tasks is zombified.
Prestissimo keeps running and when it checks for cleanup of old tasks it issues this (this is captured after running the same query 3 times):
It appears the task has more than one reference still
For example, the following query causes the issue (this is a modified subquery from TPCDS Q1) - the dataset is the 1k TPCDS dataset.
The aborted task is stage 1 and 2 (out of 4 stages) and get cancelled. It contains a partial aggregation for the limit clause and I suppose it cancels the other piece of the stage which is a tablescan that feeds into the partial aggregation. It appears the entire stage 1 and 2 are cancelled as soon as enough results are present.
Environment Description
zombie-tasks.tar.gz
Steps To Reproduce
Steps to reproduce the behavior:
1.
2.
3.
4.
Expected behavior
Additional context
Stacktrace
The text was updated successfully, but these errors were encountered: