Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization Needed for File Listing Operations on gcsi-fuse-csi Mounted Volumes in Training Jobs #200

Open
bhack opened this issue Mar 19, 2024 · 2 comments

Comments

@bhack
Copy link

bhack commented Mar 19, 2024

We've encountered a significant performance bottleneck in our training jobs, specifically when using file listing commands like Path.rglob to enumerate trainable assets stored on gcsi-fuse-csi mounted volumes. This issue becomes particularly evident with datasets of typical size, leading to considerable cold start delays before training can commence.

This latency not only hinders the initial start-up of our training jobs but also poses a substantial challenge when utilizing GKE spot instances. Each time a job is preempted and subsequently restarts from the last saved checkpoint, it incurs this cold start penalty again due to the necessity to re-prepare data loaders.

This recurring overhead directly impacts cost-efficiency and resource utilization, particularly in a dynamic scaling environment where jobs are frequently interrupted and resumed. Addressing this file listing performance issue could significantly reduce start-up times and improve the overall efficiency of training jobs on GKE spot instances.

@songjiaxun
Copy link
Collaborator

I believe we will need the following new features or improvements to solve this issue:

  1. Somehow pre-fetch or cache the object metadata to perform the fast listing.
  2. Persistent these metadata across pod or node lifecycles.

We are working on our roadmap, and will share more information.

FYI @sethiay

@bhack
Copy link
Author

bhack commented Mar 20, 2024

This seems a good plan but I think also that there is a specific extra overhead with Path.rglob if we compare with other types of listing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants