Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emphermal storage for very large tar ~100GB #57

Open
ashish01987 opened this issue Aug 24, 2023 · 30 comments
Open

emphermal storage for very large tar ~100GB #57

ashish01987 opened this issue Aug 24, 2023 · 30 comments
Labels
enhancement New feature or request

Comments

@ashish01987
Copy link

ashish01987 commented Aug 24, 2023

I have a local folder (backup) of ~100GB files , if i directly tar the folder onto bucket e.g tar -cf /tmp/bucketmount/backup.tar /backup/ , will there be any issues with csi driver ? I see that gcsfuse csi depends on tempDir{} or some temp directore for staging files before they are uploaded to bucket.

@ashish01987 ashish01987 changed the title I have a local folder (backup) of ~100GB files , if i directly tar the folder onto bucket e.g tar -cf /tmp/bucketmount/backup.tar /backup/ emphermal storage for very large tar ~100GB Aug 24, 2023
@songjiaxun
Copy link
Collaborator

I think CSI driver should work in this use case. @ashish01987 do you see any errors?

As you mentioned, gcsfuse uses a temp directory for staging files, as a result, please consider increasing the sidecar container ephemeral-storage-limit so that gcsfuse has enough space for staging the files.

See the GKE documentation for more information.

@ashish01987
Copy link
Author

ashish01987 commented Aug 25, 2023

..

@ashish01987
Copy link
Author

Thanks for the quick response. I created a tar file "backup.tar" of 30GB directly on the GCS mounted bucket (mounted by csi side car). And did not find any issues with it. Just one question here
When "backup.tar" (30GB) size is being created on the mounted bucket, will the csi side car wait for complete "backup.tar" (30GB) file being created on ephemeral storage (emptyDir: {}) and then copy it to actual bucket on cloud storage ?

If yes, i am bit concern about case where "backup.tar" size go on increasing (maybe 100GB or more due to regular backups) and sufficient node ephermeral storage is not available. In this case, one may have increase the nodes ephermeral storage manually which might cause downtime for cluster (probably ?)

I see that csi side car uses this "gke-gcsfuse-tmp" mount point from emptyDir{} for staging files before uploading

  • emptyDir: {}
    name: gke-gcsfuse-tmp

It will be great if allocating storage from regular persistent disk (or nfs share ) is supported here for gke-gcsfuse-tmp. In that way we can allocate any amount of storage without changing the nodes ephermeral storage (and avoiding cluster downtime)

I tried something like this
volumes:

  • name: gke-gcsfuse-tmp
    persistentVolumeClaim:
    claimName: common-backup
    where "common-backup" alllocates storage from pd or nfs share based on storage class.

However it did not work as the deployment did not start and was not able to find the csi side car. Probably some validations are in place to check if "gke-gcsfuse-tmp" is using emptyDir: {} only ?

Maybe supporting emptyDir: {} as well allocation storage from pvc as above for "gke-gcsfuse-tmp" might be beneficial (if implementation is feasible).

@songjiaxun Let me know your thoughts on this

@ashish01987
Copy link
Author

@songjiaxun any thoughts on this ?

@songjiaxun
Copy link
Collaborator

Hi @ashish01987 , thanks for testing out the staging file system.

To answer your question, yes, in current design, the volume gke-gcsfuse-tmp has to be an emptyDir, see the validation logic code.

The GCS FUSE team is working on write-through features, which means the staging volume may not be needed in the future release. @sethiay and @Tulsishah, could you share more information about the write-through feature? And will the write-through feature support this "tar file" use case?

Meanwhile, @judemars FYI as you may need to add a new volume to the sidecar container for the read caching feature.

@sethiay
Copy link
Contributor

sethiay commented Aug 30, 2023

Thanks @songjiaxun for looping us in. Currently, we are evaluating to support write-through feature in GCSFuse i.e. to allow users to write directly to GCS without buffering on local disk. Given that tar works now with GCSFuse, we expect it to work with write-through feature as well.

@ashish01987
Copy link
Author

what is the expected timeline for the write through feature ?

@sethiay
Copy link
Contributor

sethiay commented Sep 4, 2023

@ashish01987 Currently, we don't have any timelines to share.

@ashish01987
Copy link
Author

ashish01987 commented Sep 4, 2023

@songjiaxun since we dont know timeline for write through feature, as a work around can we disable this validation logic code. and support allocation storage from any pvc for gke-gcsfuse-tmp

i.e the storage can be allocated from persistent disk instead of nodes emphermal storage ?

in that way the customer using gcsfuse csi will never face issue like "insufficient ephermal storage"

Not sure but such issues can arise in cluster where multiple pods are having their own gcs-csi-side car instance

@songjiaxun
Copy link
Collaborator

@ashish01987 , thanks for the suggestion.

As we have more and more customers reporting "insufficient ephermal storage" issues, we are exploring the possibilities to allow users to use other media volumes rather than emptyDir for the write staging.

FYI @judemars .

@ashish01987
Copy link
Author

ashish01987 commented Sep 22, 2023

For time being is it possible to make this validation

if v.Name == SidecarContainerVolumeName && v.VolumeSource.EmptyDir != nil {
optional ?
So that I define side car with gke-gcsfuse-tmp pointing to PVC
`apiVersion: v1
kind: Pod
metadata:
name: sidecar-test

spec:
serviceAccountName: gcs-csi
containers:

  • name: busybox
    image: busybox
    resources:
    limits:
    cpu: 250m
    ephemeral-storage: 1Gi
    memory: 256Mi
    requests:
    cpu: 250m
    ephemeral-storage: 1Gi
    memory: 256Mi
    command:

    • "/bin/sh"
    • "-c"
    • sleep infinite
      volumeMounts:
    • name: gcs-fuse-csi-ephemeral
      mountPath: /data
  • name: gke-gcsfuse-sidecar
    image: gke.gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v0.1.4-gke.1@sha256:442969f1e565ba63ff22837ce7a530b6cbdb26330140b7f9e1dc23f53f1df335
    imagePullPolicy: IfNotPresent
    args:

    • --v=5
      resources:
      limits:
      cpu: 250m
      ephemeral-storage: 1Gi
      memory: 256Mi
      requests:
      cpu: 250m
      ephemeral-storage: 1Gi
      memory: 256Mi
      securityContext:
      allowPrivilegeEscalation: false
      capabilities:
      drop:
      • ALL
        readOnlyRootFilesystem: true
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
        type: RuntimeDefault
        volumeMounts:
    • mountPath: /gcsfuse-tmp
      name: gke-gcsfuse-tmp
      volumes:
  • name: gcs-fuse-csi-ephemeral
    csi:
    driver: gcsfuse.csi.storage.gke.io
    volumeAttributes:
    bucketName: # unique bucket name

  • persistentVolumeClaim:
    claimName: my-pvc-backup
    name: gke-gcsfuse-tmp`

@ashish01987
Copy link
Author

I see that writing very large files like 70GB (.tar file) will fail if that much ephemeral storage is not present on the node

@songjiaxun
Copy link
Collaborator

@ashish01987 , thanks for the suggestion, and reporting the issue. I am actively working on skipping the validation and will keep you posted.

@ashish01987
Copy link
Author

Thanks for looking into this. May be it will be great if "claimName: my-pvc-backup" for gke-gcsfuse-tmp` can be passed as parameter through annotation on the pod.

@songjiaxun
Copy link
Collaborator

I am working on a new feature to allow you to specify a separate volume for the write buffering. I will keep you posted.

@songjiaxun songjiaxun added the enhancement New feature or request label Jan 11, 2024
@bhack
Copy link

bhack commented Jan 25, 2024

To not cross-posting I think that in the meantime we could still better notify the user about the sidecard specific tempdir pressure/occupancy
See more at #21 (comment)

@bhack
Copy link

bhack commented Jan 30, 2024

@songjiaxun In the meant time is there a temp workaround to monitor ephemeral occupancy with a kubectl exec command on the gcsfuse sidecar container?
I have a pod evicted but I cannot for ephemeral storage excess and I want to investigate/monitor the gcsfuse ephemeral disk occupancy.

@songjiaxun
Copy link
Collaborator

Hi @bhack, because the gcsfuse sidecar container is a distroless container, it means you cannot run any bash commands using kubectl exec.

We are rolling out the feature to support custom volumes for write buffering. The new feature should be available soon.

Meanwhile, if you are experiencing ephemeral storage limit issues, consider setting the pod annotation gke-gcsfuse/ephemeral-storage-limit: "0". It will unset any ephemeral storage limit on the sidecar container.

@bhack
Copy link

bhack commented Jan 31, 2024

gke-gcsfuse/ephemeral-storage-limit: "0

Is this ok in Autopilot or will it be rejected?

@songjiaxun
Copy link
Collaborator

gke-gcsfuse/ephemeral-storage-limit: "0

Is this ok in Autopilot or will it be rejected?

Oh sorry, I forgot the context of Autopilot. Unfortunately, no, gke-gcsfuse/ephemeral-storage-limit: "0" only works on Standard clusters.

Is your application writing large files back to the bucekt?

@bhack
Copy link

bhack commented Jan 31, 2024

Is your application writing large files back to the bucekt?

Not so large.
It is a classical ML workload like regular checkpoint + TB logs.
All it is going ok but at some point, after many K-steps, the pod start to be regularly evicted for ephemeral storage also after restarting from the last checkpoint (e.g. I've also tested this with a restarting job or spot instances).

The main pod it is quite complex but it seems it is not writing anything other then on the csi-gcsfuse mounted volumes.
But with the current tools it is hard to debug.

It is why we need something to monitoring the sidecar pressure/occupancy on the ephemeral (and likely on CPU and MEM).

It is both to debug when things fail and to create a residual margin on resources planning.
The last one is also important on Autopilot as we cannot just set the limits to 0.

Just to focus on the ephemeral point In the sidecar log I see
sidecar_mounter.go:86] gcsfuse mounting with args [gcsfuse --temp-dir <volume-name-temp-dir>

Can you monitor that temp-dir occupancy in Go so that we can start to have some warning in the logs?

@songjiaxun
Copy link
Collaborator

Thanks for the information @bhack . Yes, we do plan to add more logs and warnings to make the ephemeral storage usage more observable.

Can I know your node type? What compute class or hardware configuration are you using?

So on Autopilot, for most of the compute classes, the maximum ephemeral storage you can use is 10Gi. So you can use annotation gke-gcsfuse/ephemeral-storage-limit: 10Gi to specify it. Please note that the container image and logging also use ephemeral storage.

@bhack
Copy link

bhack commented Jan 31, 2024

Please note that the container image and logging also use ephemeral storage.

Is the container image part of the node ephemeral? but I don't think it is part of the pod ephemeral request or not?

Cause if the image it is not part of the pod request, if we are going to request 4Gi or 5Gi -gcsfuse/ephemeral-storage-limit: and we were evicted cause we surpassed 4Gi or 5Gi of this limit it could be not caused by the image size right?

Edit:
The hw configuration in this test was:
nvidia-tesla-a100 in the 16 GPU config so it is a2-megagpu-16g

@songjiaxun
Copy link
Collaborator

Hi @bhack,

  1. Yes, the container image is a part of the node ephemeral storage, see the documentation Ephemeral storage consumption management
  2. The ephemeral storage limit calculation is different from cpu or memory -- cpu or memory limit is at container level, however, the ephemeral storage limit is at Pod level. See the documentation How Pods with ephemeral-storage requests are scheduled. This means, even though the ephemeral storage limit is applied on the gcsfuse sidecar container, all the containers in the Pod are subject to that limit.
  3. On Autopilot cluster, the maximum ephemeral storage is 10Gi, see Minimum and maximum resource requests.
  4. Combing all the factors, here is my suggestion:
  • Change the Pod annotation to gke-gcsfuse/ephemeral-storage-limit: 10Gi.
  • Audit your application to see if the container image is too large.
  • Wait for the custom buffer volume support, which should be available very soon.

@bhack
Copy link

bhack commented Feb 1, 2024

The main problem is still auditing the gcsfuse sidecar Vs the pod.
If I am running a job for 4/5 hours doing exactly the same things e.g. like a training job/loop and then the pod was evicted I need to understand what is going to happen to the sidecar on the ephemeral storage.
Is there a problem on the sidecar driver accumulating too much files in the temp dir at some point? Is there a bug?
If I don't know what is the specific ephemeral pressure of the sidecar tempdir it is impossible to investigate.

@songjiaxun
Copy link
Collaborator

@bhack, yes, it makes sense. I will let you know when the warning logs are ready.

@bhack
Copy link

bhack commented Feb 1, 2024

@bhack, yes, it makes sense. I will let you know when the warning logs are ready.

Thanks, I hope that we could add this also for CPU and Memory later especially as in Autopilot we cannot set sidecar resources to "0".

@songjiaxun
Copy link
Collaborator

Hi @bhack , I wanted to use the same fs metrics collection approach Kubernetes uses, for example, to use SYS_STATFS system call: https://github.com/kubernetes/kubernetes/blob/dbd3f3564ac6cca9a152a3244ab96257e5a4f00c/pkg/volume/util/fs/fs.go#L40-L63

However, i believe the SYS_STATFS system call does the calculation at a device level. It means, if the buffer volume is an emptyDir, which is the default setting in our case, the returned volume usage/availability is the underlaying boot disk usage/availability.

I am exploring other approaches to just calculate the buffer volume usage.

@bhack
Copy link

bhack commented Feb 5, 2024

Are they not using the same unix.Statfs in emptydir?
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/emptydir/empty_dir_linux.go#L94-L100

@bhack
Copy link

bhack commented Feb 5, 2024

What do you think about kubernetes/kubernetes#121489 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants