Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgpu not restricting memory in the container #3384

Open
kunal642 opened this issue Apr 3, 2024 · 25 comments
Open

vgpu not restricting memory in the container #3384

kunal642 opened this issue Apr 3, 2024 · 25 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kunal642
Copy link

kunal642 commented Apr 3, 2024

What happened:

When running the vgpu example provided in the docs. When vgpu memory limit is set, the container does not respect this limit as shown by the nvidia-smi command(32GB memory is shown as output for V100)

What you expected to happen:

The memory inside the container should be limited to vgpu-memory configuration.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
Nvidia-smi version: 545.23.08
MIG M: NA

Environment:

  • Volcano Version: 1.8.x
  • Kubernetes version (use kubectl version): v1.28.x
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@kunal642 kunal642 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 3, 2024
@kunal642 kunal642 changed the title Using volcano vgpu not restricting memory in the container vgpu not restricting memory in the container Apr 3, 2024
@lowang-bh
Copy link
Member

/assign @archlitchi

@kunal642
Copy link
Author

Hey @archlitchi, Can you suggest something for this?

@archlitchi
Copy link
Contributor

archlitchi commented Apr 10, 2024

Hey @archlitchi, Can you suggest something for this?

could you provide the following information:

  1. The vgpu-task yaml you submitted?
  2. "env" result inside container

@kunal642
Copy link
Author

kunal642 commented Apr 10, 2024

 cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1
          volcano.sh/vgpu-memory: 200

nodeSelector: ...
tolerations: ...
EOF

NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu

@archlitchi
Copy link
Contributor

 cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1
          volcano.sh/vgpu-memory: 3000

nodeSelector: ...
tolerations: ...
EOF

NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu

Could you provide the 'env' result inside container?

@kunal642
Copy link
Author

wont be able to copy the complete output. If you are looking for a particular property, i should be able to get that for you

@archlitchi
Copy link
Contributor

okay, please list env which contains keyword 'CUDA' or 'NVIDIA'

@kunal642
Copy link
Author

kunal642 commented Apr 10, 2024

Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me

NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache

@archlitchi
Copy link
Contributor

Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me

NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache

em.... IS this container the one in the yaml file? you allocated 3G in your yaml, but here it only gets 200M, besides, this is probably a cuda image, not a typical ubuntu:18.04

@kunal642
Copy link
Author

kunal642 commented Apr 10, 2024

sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well

@archlitchi
Copy link
Contributor

sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well

Please check if the following file exists inside container, AND the size of each file does NOT equal to 0:

  1. /usr/local/vgpu/libvgpu.so
  2. /etc/ld.so.preload

@kunal642
Copy link
Author

/usr/local/vgpu/libvgpu.so -> Exists with non 0 size
/etc/ld.so.preload > does not exist

@archlitchi
Copy link
Contributor

/usr/local/vgpu/libvgpu.so -> Exists with non 0 size /etc/ld.so.preload > does not exist

okay, i got it, please use the following image volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219 instead in volcano-vgpu-device-plugin.yml

@kunal642
Copy link
Author

okay, let me try this!!!

@kunal642
Copy link
Author

kunal642 commented Apr 16, 2024

Hey @archlitchi, The mentioned error is on the same image(volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219). It was deployed a month ago, has anything changed since then?

@kunal642
Copy link
Author

Hey @archlitchi, any other suggestions to fix this?

@EswarS
Copy link

EswarS commented Apr 29, 2024

Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.

@archlitchi
Copy link
Contributor

archlitchi commented Apr 30, 2024

Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.

@kunal642

ok, i'm looking into it now, sorry i didn't see your replies last two weeks

@archlitchi
Copy link
Contributor

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

@kunal642
Copy link
Author

kunal642 commented May 2, 2024

@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?

@EswarS
Copy link

EswarS commented May 3, 2024

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

Is this device plugin compatible with volcano 1.8.2 release package.

I deployed the device plugin
Facing following error
Initializing …..
Fail to open shrreg ***.cache (errorno:11)
Fail to init shrreg ****.cache (errorno:9)
Fail to write shrreg ***.cache (errorno:9)
Fail to reseek shrreg ***.cache (errorno:9)
Fail to lock shrreg ***.cache (errorno:9)

@archlitchi
Copy link
Contributor

@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?

yes, can you run your task now?

@archlitchi
Copy link
Contributor

@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.

Is this device plugin compatible with volcano 1.8.2 release package.

I deployed the device plugin Facing following error Initializing ….. Fail to open shrreg ***.cache (errorno:11) Fail to init shrreg ****.cache (errorno:9) Fail to write shrreg ***.cache (errorno:9) Fail to reseek shrreg ***.cache (errorno:9) Fail to lock shrreg ***.cache (errorno:9)

The vgpu-device-plugin mounts your hostPath "/tmp/vgpu/containers/{containerUID}_{ctrName}" into containerPath "/tmp/vgpu" please check if the corresponding hostPath exists

@EswarS
Copy link

EswarS commented May 8, 2024

VolumeMounts:
-mountPath: /car/lib/kubelet/device-plugins
-name: device-plugin
-mountPath:/usr/local/vgpu
-name: lib
-mountPath: /tmp
-name: hosttmp

like above are the volumes configured in device-plugin daemon.
Do I need to make any changes?

@archlitchi
Copy link
Contributor

@EswarS No, i mean, after you submit a vgpu task into volcano, please check

  1. Is the corresponding folder "/tmp/vgpu/containers/{containerUID}_{ctrName}" exists in your corresponding GPU node.
  2. is the folder "/tmp/vgpu" exists inside the vgpu-task container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants