New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shuffle pipeline takes extremely long time to finish #7247
Comments
Hi @RaananHadar ! As a side-effect of some the performance trade-offs made in 2.0, the shuffle pipeline implemented with Empty Files has become comparatively less performant than copying files from input to Could you try shuffling with |
Agent Dale Georg linked Freshdesk ticket 401 for this issue.
|
I've changed empty_files to false and attempted to copy the files. This does improve performance for a small amount of files. However for my practical pipeline, I'm talking gigabytes per datum at ~10k files, meaning the user code now becomes the bottleneck. So now I have to optimize my user code to parallel copy... which is not terrible, but I honestly think that the empty_files pattern has many advantages had it been optimized... |
The peculiar thing is that this also happens for 1.13.4. It takes me about 8 minutes to run the above symlink pipeline on 1.13.4 with only 50 generated files. I think there's something else at work here. |
Sorry for the delay, @RaananHadar. There have been improvements made to compaction and symlinking in 2.1.6 and 2.1.7 that provide significant performance improvements. Can you upgrade to 2.1.7 and let us know how performance looks for your specific use case? |
Thank you @BOsterbuhr, will give it a retest and report back. |
I did test this with 2.1.7 on azure. And I even simplified this somewhat to test more general shuffle pipelines. In my simplified pipeline I take x inputs and map them to x outputs (I can share the code if needed). Here are my findings: Bottom line, I think solving this via the |
Here is the updated mock to reproduce this: data generation code:
pipeline code's main.py:
pipeline spec:
|
What happened?:
I've created a shuffle pipeline that takes a "/" glob with files:
and creates an output with combinations:
For each combination...
The user code finishes in less than a second. The storage container takes extremely long to finish. For only 20 files, the job takes ~40 seconds. For 50 it may take minutes and for the real number of files the job is required to handle (10s of thousands), its impractical to run in Pachyderm at the moment.
What you expected to happen?:
Its a shuffle pipeline with empty files, it should complete extremely quickly.
How to reproduce it (as minimally and precisely as possible)?:
I've created a basic mock to reproduce this pipeline:
Anything else we need to know?:
Environment?:
kubectl version
): AKS 1.20.9pachctl version
): This was actually tested and reproduced with the latest versions of both Pachyderm 1.13.4 and Pachyderm 2.0.5Looking at the storage container logs, the user code finishes in less than a second. The storage container takes a while to run, looks like its generating a lot of calls to pachd. As I said, this finishes successfully for a small number of files, but it looks like the code does not scale to large numbers.
The text was updated successfully, but these errors were encountered: