emphermal storage for very large tar ~100GB #57

ashish01987 · 2023-08-24T17:01:48Z

I have a local folder (backup) of ~100GB files , if i directly tar the folder onto bucket e.g tar -cf /tmp/bucketmount/backup.tar /backup/ , will there be any issues with csi driver ? I see that gcsfuse csi depends on tempDir{} or some temp directore for staging files before they are uploaded to bucket.

songjiaxun · 2023-08-24T21:20:29Z

I think CSI driver should work in this use case. @ashish01987 do you see any errors?

As you mentioned, gcsfuse uses a temp directory for staging files, as a result, please consider increasing the sidecar container ephemeral-storage-limit so that gcsfuse has enough space for staging the files.

See the GKE documentation for more information.

ashish01987 · 2023-08-25T17:55:25Z

..

ashish01987 · 2023-08-26T11:31:24Z

Thanks for the quick response. I created a tar file "backup.tar" of 30GB directly on the GCS mounted bucket (mounted by csi side car). And did not find any issues with it. Just one question here
When "backup.tar" (30GB) size is being created on the mounted bucket, will the csi side car wait for complete "backup.tar" (30GB) file being created on ephemeral storage (emptyDir: {}) and then copy it to actual bucket on cloud storage ?

If yes, i am bit concern about case where "backup.tar" size go on increasing (maybe 100GB or more due to regular backups) and sufficient node ephermeral storage is not available. In this case, one may have increase the nodes ephermeral storage manually which might cause downtime for cluster (probably ?)

I see that csi side car uses this "gke-gcsfuse-tmp" mount point from emptyDir{} for staging files before uploading

emptyDir: {}
name: gke-gcsfuse-tmp

It will be great if allocating storage from regular persistent disk (or nfs share ) is supported here for gke-gcsfuse-tmp. In that way we can allocate any amount of storage without changing the nodes ephermeral storage (and avoiding cluster downtime)

I tried something like this
volumes:

name: gke-gcsfuse-tmp
persistentVolumeClaim:
claimName: common-backup
where "common-backup" alllocates storage from pd or nfs share based on storage class.

However it did not work as the deployment did not start and was not able to find the csi side car. Probably some validations are in place to check if "gke-gcsfuse-tmp" is using emptyDir: {} only ?

Maybe supporting emptyDir: {} as well allocation storage from pvc as above for "gke-gcsfuse-tmp" might be beneficial (if implementation is feasible).

@songjiaxun Let me know your thoughts on this

ashish01987 · 2023-08-30T05:54:30Z

@songjiaxun any thoughts on this ?

songjiaxun · 2023-08-30T06:06:31Z

Hi @ashish01987 , thanks for testing out the staging file system.

To answer your question, yes, in current design, the volume gke-gcsfuse-tmp has to be an emptyDir, see the validation logic code.

The GCS FUSE team is working on write-through features, which means the staging volume may not be needed in the future release. @sethiay and @Tulsishah, could you share more information about the write-through feature? And will the write-through feature support this "tar file" use case?

Meanwhile, @judemars FYI as you may need to add a new volume to the sidecar container for the read caching feature.

sethiay · 2023-08-30T11:02:45Z

Thanks @songjiaxun for looping us in. Currently, we are evaluating to support write-through feature in GCSFuse i.e. to allow users to write directly to GCS without buffering on local disk. Given that tar works now with GCSFuse, we expect it to work with write-through feature as well.

ashish01987 · 2023-09-03T21:13:29Z

what is the expected timeline for the write through feature ?

sethiay · 2023-09-04T07:41:38Z

@ashish01987 Currently, we don't have any timelines to share.

ashish01987 · 2023-09-04T12:02:38Z

@songjiaxun since we dont know timeline for write through feature, as a work around can we disable this validation logic code. and support allocation storage from any pvc for gke-gcsfuse-tmp

i.e the storage can be allocated from persistent disk instead of nodes emphermal storage ?

in that way the customer using gcsfuse csi will never face issue like "insufficient ephermal storage"

Not sure but such issues can arise in cluster where multiple pods are having their own gcs-csi-side car instance

songjiaxun · 2023-09-05T18:17:05Z

@ashish01987 , thanks for the suggestion.

As we have more and more customers reporting "insufficient ephermal storage" issues, we are exploring the possibilities to allow users to use other media volumes rather than emptyDir for the write staging.

FYI @judemars .

ashish01987 · 2023-09-22T09:24:27Z

For time being is it possible to make this validation

gcs-fuse-csi-driver/pkg/webhook/sidecar_spec.go

Line 113 in b0d0325

if v.Name == SidecarContainerVolumeName && v.VolumeSource.EmptyDir != nil {

optional ?
So that I define side car with gke-gcsfuse-tmp pointing to PVC
`apiVersion: v1
kind: Pod
metadata:
name: sidecar-test

spec:
serviceAccountName: gcs-csi
containers:

name: busybox
image: busybox
resources:
limits:
cpu: 250m
ephemeral-storage: 1Gi
memory: 256Mi
requests:
cpu: 250m
ephemeral-storage: 1Gi
memory: 256Mi
command:
- "/bin/sh"
- "-c"
- sleep infinite
  volumeMounts:
- name: gcs-fuse-csi-ephemeral
  mountPath: /data
name: gke-gcsfuse-sidecar
image: gke.gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v0.1.4-gke.1@sha256:442969f1e565ba63ff22837ce7a530b6cbdb26330140b7f9e1dc23f53f1df335
imagePullPolicy: IfNotPresent
args:
- --v=5
  resources:
  limits:
  cpu: 250m
  ephemeral-storage: 1Gi
  memory: 256Mi
  requests:
  cpu: 250m
  ephemeral-storage: 1Gi
  memory: 256Mi
  securityContext:
  allowPrivilegeEscalation: false
  capabilities:
  drop:
  - ALL
    readOnlyRootFilesystem: true
    runAsGroup: 65534
    runAsNonRoot: true
    runAsUser: 65534
    seccompProfile:
    type: RuntimeDefault
    volumeMounts:
- mountPath: /gcsfuse-tmp
  name: gke-gcsfuse-tmp
  volumes:
name: gcs-fuse-csi-ephemeral
csi:
driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName: # unique bucket name
persistentVolumeClaim:
claimName: my-pvc-backup
name: gke-gcsfuse-tmp`

ashish01987 · 2023-09-22T09:28:56Z

I see that writing very large files like 70GB (.tar file) will fail if that much ephemeral storage is not present on the node

songjiaxun · 2023-09-22T21:54:25Z

@ashish01987 , thanks for the suggestion, and reporting the issue. I am actively working on skipping the validation and will keep you posted.

ashish01987 · 2023-09-23T13:44:25Z

Thanks for looking into this. May be it will be great if "claimName: my-pvc-backup" for gke-gcsfuse-tmp` can be passed as parameter through annotation on the pod.

songjiaxun · 2023-12-07T20:07:13Z

I am working on a new feature to allow you to specify a separate volume for the write buffering. I will keep you posted.

bhack · 2024-01-25T14:01:28Z

To not cross-posting I think that in the meantime we could still better notify the user about the sidecard specific tempdir pressure/occupancy
See more at #21 (comment)

bhack · 2024-01-30T12:23:07Z

@songjiaxun In the meant time is there a temp workaround to monitor ephemeral occupancy with a kubectl exec command on the gcsfuse sidecar container?
I have a pod evicted but I cannot for ephemeral storage excess and I want to investigate/monitor the gcsfuse ephemeral disk occupancy.

songjiaxun · 2024-01-31T22:36:36Z

Hi @bhack, because the gcsfuse sidecar container is a distroless container, it means you cannot run any bash commands using kubectl exec.

We are rolling out the feature to support custom volumes for write buffering. The new feature should be available soon.

Meanwhile, if you are experiencing ephemeral storage limit issues, consider setting the pod annotation gke-gcsfuse/ephemeral-storage-limit: "0". It will unset any ephemeral storage limit on the sidecar container.

bhack · 2024-01-31T22:38:11Z

gke-gcsfuse/ephemeral-storage-limit: "0

Is this ok in Autopilot or will it be rejected?

songjiaxun · 2024-01-31T22:39:57Z

gke-gcsfuse/ephemeral-storage-limit: "0

Is this ok in Autopilot or will it be rejected?

Oh sorry, I forgot the context of Autopilot. Unfortunately, no, gke-gcsfuse/ephemeral-storage-limit: "0" only works on Standard clusters.

Is your application writing large files back to the bucekt?

bhack · 2024-01-31T22:53:42Z

Is your application writing large files back to the bucekt?

Not so large.
It is a classical ML workload like regular checkpoint + TB logs.
All it is going ok but at some point, after many K-steps, the pod start to be regularly evicted for ephemeral storage also after restarting from the last checkpoint (e.g. I've also tested this with a restarting job or spot instances).

The main pod it is quite complex but it seems it is not writing anything other then on the csi-gcsfuse mounted volumes.
But with the current tools it is hard to debug.

It is why we need something to monitoring the sidecar pressure/occupancy on the ephemeral (and likely on CPU and MEM).

It is both to debug when things fail and to create a residual margin on resources planning.
The last one is also important on Autopilot as we cannot just set the limits to 0.

Just to focus on the ephemeral point In the sidecar log I see
sidecar_mounter.go:86] gcsfuse mounting with args [gcsfuse --temp-dir <volume-name-temp-dir>

Can you monitor that temp-dir occupancy in Go so that we can start to have some warning in the logs?

songjiaxun · 2024-01-31T23:10:28Z

Thanks for the information @bhack . Yes, we do plan to add more logs and warnings to make the ephemeral storage usage more observable.

Can I know your node type? What compute class or hardware configuration are you using?

So on Autopilot, for most of the compute classes, the maximum ephemeral storage you can use is 10Gi. So you can use annotation gke-gcsfuse/ephemeral-storage-limit: 10Gi to specify it. Please note that the container image and logging also use ephemeral storage.

bhack · 2024-01-31T23:21:03Z

Please note that the container image and logging also use ephemeral storage.

Is the container image part of the node ephemeral? but I don't think it is part of the pod ephemeral request or not?

Cause if the image it is not part of the pod request, if we are going to request 4Gi or 5Gi -gcsfuse/ephemeral-storage-limit: and we were evicted cause we surpassed 4Gi or 5Gi of this limit it could be not caused by the image size right?

Edit:
The hw configuration in this test was:
nvidia-tesla-a100 in the 16 GPU config so it is a2-megagpu-16g

songjiaxun · 2024-02-01T07:19:54Z

Hi @bhack,

Yes, the container image is a part of the node ephemeral storage, see the documentation Ephemeral storage consumption management
The ephemeral storage limit calculation is different from cpu or memory -- cpu or memory limit is at container level, however, the ephemeral storage limit is at Pod level. See the documentation How Pods with ephemeral-storage requests are scheduled. This means, even though the ephemeral storage limit is applied on the gcsfuse sidecar container, all the containers in the Pod are subject to that limit.
On Autopilot cluster, the maximum ephemeral storage is 10Gi, see Minimum and maximum resource requests.
Combing all the factors, here is my suggestion:

Change the Pod annotation to gke-gcsfuse/ephemeral-storage-limit: 10Gi.
Audit your application to see if the container image is too large.
Wait for the custom buffer volume support, which should be available very soon.

bhack · 2024-02-01T10:45:27Z

The main problem is still auditing the gcsfuse sidecar Vs the pod.
If I am running a job for 4/5 hours doing exactly the same things e.g. like a training job/loop and then the pod was evicted I need to understand what is going to happen to the sidecar on the ephemeral storage.
Is there a problem on the sidecar driver accumulating too much files in the temp dir at some point? Is there a bug?
If I don't know what is the specific ephemeral pressure of the sidecar tempdir it is impossible to investigate.

songjiaxun · 2024-02-01T18:13:14Z

@bhack, yes, it makes sense. I will let you know when the warning logs are ready.

bhack · 2024-02-01T18:25:11Z

@bhack, yes, it makes sense. I will let you know when the warning logs are ready.

Thanks, I hope that we could add this also for CPU and Memory later especially as in Autopilot we cannot set sidecar resources to "0".

songjiaxun · 2024-02-05T18:01:05Z

Hi @bhack , I wanted to use the same fs metrics collection approach Kubernetes uses, for example, to use SYS_STATFS system call: https://github.com/kubernetes/kubernetes/blob/dbd3f3564ac6cca9a152a3244ab96257e5a4f00c/pkg/volume/util/fs/fs.go#L40-L63

However, i believe the SYS_STATFS system call does the calculation at a device level. It means, if the buffer volume is an emptyDir, which is the default setting in our case, the returned volume usage/availability is the underlaying boot disk usage/availability.

I am exploring other approaches to just calculate the buffer volume usage.

bhack · 2024-02-05T18:34:42Z

Are they not using the same unix.Statfs in emptydir?
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/emptydir/empty_dir_linux.go#L94-L100

bhack · 2024-02-05T19:08:03Z

What do you think about kubernetes/kubernetes#121489 ?

ashish01987 changed the title ~~I have a local folder (backup) of ~100GB files , if i directly tar the folder onto bucket e.g tar -cf /tmp/bucketmount/backup.tar /backup/~~ emphermal storage for very large tar ~100GB Aug 24, 2023

songjiaxun mentioned this issue Sep 5, 2023

Add basic experimental cache support #58

Merged

songjiaxun added the enhancement New feature or request label Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emphermal storage for very large tar ~100GB #57

emphermal storage for very large tar ~100GB #57

ashish01987 commented Aug 24, 2023 •

edited

songjiaxun commented Aug 24, 2023

ashish01987 commented Aug 25, 2023 •

edited

ashish01987 commented Aug 26, 2023

ashish01987 commented Aug 30, 2023

songjiaxun commented Aug 30, 2023

sethiay commented Aug 30, 2023

ashish01987 commented Sep 3, 2023

sethiay commented Sep 4, 2023

ashish01987 commented Sep 4, 2023 •

edited

songjiaxun commented Sep 5, 2023

ashish01987 commented Sep 22, 2023 •

edited

ashish01987 commented Sep 22, 2023

songjiaxun commented Sep 22, 2023

ashish01987 commented Sep 23, 2023

songjiaxun commented Dec 7, 2023

bhack commented Jan 25, 2024

bhack commented Jan 30, 2024

songjiaxun commented Jan 31, 2024

bhack commented Jan 31, 2024

songjiaxun commented Jan 31, 2024

bhack commented Jan 31, 2024

songjiaxun commented Jan 31, 2024

bhack commented Jan 31, 2024 •

edited

songjiaxun commented Feb 1, 2024

bhack commented Feb 1, 2024 •

edited

songjiaxun commented Feb 1, 2024

bhack commented Feb 1, 2024

songjiaxun commented Feb 5, 2024

bhack commented Feb 5, 2024

bhack commented Feb 5, 2024

emphermal storage for very large tar ~100GB #57

emphermal storage for very large tar ~100GB #57

Comments

ashish01987 commented Aug 24, 2023 • edited

songjiaxun commented Aug 24, 2023

ashish01987 commented Aug 25, 2023 • edited

ashish01987 commented Aug 26, 2023

ashish01987 commented Aug 30, 2023

songjiaxun commented Aug 30, 2023

sethiay commented Aug 30, 2023

ashish01987 commented Sep 3, 2023

sethiay commented Sep 4, 2023

ashish01987 commented Sep 4, 2023 • edited

songjiaxun commented Sep 5, 2023

ashish01987 commented Sep 22, 2023 • edited

ashish01987 commented Sep 22, 2023

songjiaxun commented Sep 22, 2023

ashish01987 commented Sep 23, 2023

songjiaxun commented Dec 7, 2023

bhack commented Jan 25, 2024

bhack commented Jan 30, 2024

songjiaxun commented Jan 31, 2024

bhack commented Jan 31, 2024

songjiaxun commented Jan 31, 2024

bhack commented Jan 31, 2024

songjiaxun commented Jan 31, 2024

bhack commented Jan 31, 2024 • edited

songjiaxun commented Feb 1, 2024

bhack commented Feb 1, 2024 • edited

songjiaxun commented Feb 1, 2024

bhack commented Feb 1, 2024

songjiaxun commented Feb 5, 2024

bhack commented Feb 5, 2024

bhack commented Feb 5, 2024

ashish01987 commented Aug 24, 2023 •

edited

ashish01987 commented Aug 25, 2023 •

edited

ashish01987 commented Sep 4, 2023 •

edited

ashish01987 commented Sep 22, 2023 •

edited

bhack commented Jan 31, 2024 •

edited

bhack commented Feb 1, 2024 •

edited