vgpu not restricting memory in the container #3384

kunal642 · 2024-04-03T12:36:24Z

What happened:

When running the vgpu example provided in the docs. When vgpu memory limit is set, the container does not respect this limit as shown by the nvidia-smi command(32GB memory is shown as output for V100)

What you expected to happen:

The memory inside the container should be limited to vgpu-memory configuration.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
Nvidia-smi version: 545.23.08
MIG M: NA

Environment:

Volcano Version: 1.8.x
Kubernetes version (use kubectl version): v1.28.x
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

lowang-bh · 2024-04-04T03:34:48Z

/assign @archlitchi

kunal642 · 2024-04-10T04:50:50Z

Hey @archlitchi, Can you suggest something for this?

archlitchi · 2024-04-10T07:03:55Z

Hey @archlitchi, Can you suggest something for this?

could you provide the following information:

The vgpu-task yaml you submitted?
"env" result inside container

kunal642 · 2024-04-10T07:14:00Z

 cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1
          volcano.sh/vgpu-memory: 200

nodeSelector: ...
tolerations: ...
EOF

NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu

archlitchi · 2024-04-10T07:19:58Z

 cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1
          volcano.sh/vgpu-memory: 3000

nodeSelector: ...
tolerations: ...
EOF

NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu

Could you provide the 'env' result inside container?

kunal642 · 2024-04-10T07:22:27Z

wont be able to copy the complete output. If you are looking for a particular property, i should be able to get that for you

archlitchi · 2024-04-10T07:26:10Z

okay, please list env which contains keyword 'CUDA' or 'NVIDIA'

kunal642 · 2024-04-10T07:35:17Z

Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me

NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache

archlitchi · 2024-04-10T07:42:08Z

Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me

NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache

em.... IS this container the one in the yaml file? you allocated 3G in your yaml, but here it only gets 200M, besides, this is probably a cuda image, not a typical ubuntu:18.04

kunal642 · 2024-04-10T07:43:41Z

sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well

archlitchi · 2024-04-10T07:47:23Z

sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well

Please check if the following file exists inside container, AND the size of each file does NOT equal to 0:

/usr/local/vgpu/libvgpu.so
/etc/ld.so.preload

kunal642 · 2024-04-10T07:53:30Z

/usr/local/vgpu/libvgpu.so -> Exists with non 0 size
/etc/ld.so.preload > does not exist

archlitchi · 2024-04-10T07:57:36Z

/usr/local/vgpu/libvgpu.so -> Exists with non 0 size /etc/ld.so.preload > does not exist

okay, i got it, please use the following image volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219 instead in volcano-vgpu-device-plugin.yml

kunal642 · 2024-04-10T08:32:17Z

okay, let me try this!!!

kunal642 · 2024-04-16T09:55:46Z

Hey @archlitchi, The mentioned error is on the same image(volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219). It was deployed a month ago, has anything changed since then?

kunal642 · 2024-04-23T04:55:42Z

Hey @archlitchi, any other suggestions to fix this?

EswarS · 2024-04-29T19:09:09Z

Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.

archlitchi · 2024-04-30T01:48:32Z

Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.

@kunal642

ok, i'm looking into it now, sorry i didn't see your replies last two weeks

archlitchi · 2024-04-30T03:07:37Z

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

kunal642 · 2024-05-02T10:34:56Z

@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?

EswarS · 2024-05-03T10:04:43Z

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

Is this device plugin compatible with volcano 1.8.2 release package.

I deployed the device plugin
Facing following error
Initializing …..
Fail to open shrreg ***.cache (errorno:11)
Fail to init shrreg ****.cache (errorno:9)
Fail to write shrreg ***.cache (errorno:9)
Fail to reseek shrreg ***.cache (errorno:9)
Fail to lock shrreg ***.cache (errorno:9)

archlitchi · 2024-05-06T01:30:03Z

@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?

yes, can you run your task now?

archlitchi · 2024-05-06T01:35:12Z

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

Is this device plugin compatible with volcano 1.8.2 release package.

I deployed the device plugin Facing following error Initializing ….. Fail to open shrreg ***.cache (errorno:11) Fail to init shrreg ****.cache (errorno:9) Fail to write shrreg ***.cache (errorno:9) Fail to reseek shrreg ***.cache (errorno:9) Fail to lock shrreg ***.cache (errorno:9)

The vgpu-device-plugin mounts your hostPath "/tmp/vgpu/containers/{containerUID}_{ctrName}" into containerPath "/tmp/vgpu" please check if the corresponding hostPath exists

EswarS · 2024-05-08T05:50:33Z

VolumeMounts:
-mountPath: /car/lib/kubelet/device-plugins
-name: device-plugin
-mountPath:/usr/local/vgpu
-name: lib
-mountPath: /tmp
-name: hosttmp

like above are the volumes configured in device-plugin daemon.
Do I need to make any changes?

archlitchi · 2024-05-08T08:02:41Z

@EswarS No, i mean, after you submit a vgpu task into volcano, please check

Is the corresponding folder "/tmp/vgpu/containers/{containerUID}_{ctrName}" exists in your corresponding GPU node.
is the folder "/tmp/vgpu" exists inside the vgpu-task container

kunal642 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 3, 2024

kunal642 changed the title ~~Using volcano vgpu not restricting memory in the container~~ vgpu not restricting memory in the container Apr 3, 2024

volcano-sh-bot assigned archlitchi Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vgpu not restricting memory in the container #3384

vgpu not restricting memory in the container #3384

kunal642 commented Apr 3, 2024

lowang-bh commented Apr 4, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024 •

edited

kunal642 commented Apr 10, 2024 •

edited

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024 •

edited

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024 •

edited

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

kunal642 commented Apr 16, 2024 •

edited

kunal642 commented Apr 23, 2024

EswarS commented Apr 29, 2024

archlitchi commented Apr 30, 2024 •

edited

archlitchi commented Apr 30, 2024

kunal642 commented May 2, 2024

EswarS commented May 3, 2024

archlitchi commented May 6, 2024

archlitchi commented May 6, 2024

EswarS commented May 8, 2024 •

edited

archlitchi commented May 8, 2024

vgpu not restricting memory in the container #3384

vgpu not restricting memory in the container #3384

Comments

kunal642 commented Apr 3, 2024

lowang-bh commented Apr 4, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024 • edited

kunal642 commented Apr 10, 2024 • edited

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024 • edited

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024 • edited

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

archlitchi commented Apr 10, 2024

kunal642 commented Apr 10, 2024

kunal642 commented Apr 16, 2024 • edited

kunal642 commented Apr 23, 2024

EswarS commented Apr 29, 2024

archlitchi commented Apr 30, 2024 • edited

archlitchi commented Apr 30, 2024

kunal642 commented May 2, 2024

EswarS commented May 3, 2024

archlitchi commented May 6, 2024

archlitchi commented May 6, 2024

EswarS commented May 8, 2024 • edited

archlitchi commented May 8, 2024

archlitchi commented Apr 10, 2024 •

edited

kunal642 commented Apr 10, 2024 •

edited

kunal642 commented Apr 10, 2024 •

edited

kunal642 commented Apr 10, 2024 •

edited

kunal642 commented Apr 16, 2024 •

edited

archlitchi commented Apr 30, 2024 •

edited

EswarS commented May 8, 2024 •

edited