Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPU access with DeviceRequests #7929

Merged
merged 1 commit into from Nov 17, 2020

Conversation

aiordache
Copy link
Contributor

Convert compose-spec devices mapping to DeviceRequest to enable GPU access to containers.

Tested on a GPU host:

$ cat /etc/docker/daemon.json 
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Sample compose files:

services:
  test:
    image: nvidia/cuda
    command: nvidia-smi
    runtime: nvidia

or

services:
  test:
    image: nvidia/cuda
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
          - 'driver': 'nvidia'
            'count': 1
            'capabilities': ['gpu', 'utility']
$ docker-compose up
Creating network "gpu_default" with the default driver
Creating gpu_test_1 ... done
Attaching to gpu_test_1
test_1  | Fri Nov 13 20:46:11 2020       
test_1  | +-----------------------------------------------------------------------------+
test_1  | | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.1     |
test_1  | |-------------------------------+----------------------+----------------------+
test_1  | | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
test_1  | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
test_1  | |                               |                      |               MIG M. |
test_1  | |===============================+======================+======================|
test_1  | |   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
test_1  | | N/A   23C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
test_1  | |                               |                      |                  N/A |
test_1  | +-------------------------------+----------------------+----------------------+
test_1  |                                                                                
test_1  | +-----------------------------------------------------------------------------+
test_1  | | Processes:                                                                  |
test_1  | |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
test_1  | |        ID   ID                                                   Usage      |
test_1  | |=============================================================================|
test_1  | |  No running processes found                                                 |
test_1  | +-----------------------------------------------------------------------------+
gpu_test_1 exited with code 0

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;print(tf.test.gpu_device_name())"
    deploy:
      resources:
        reservations:
          devices:
          - 'driver': 'nvidia'
            'capabilities': ['gpu']
$ docker-compose up
Creating network "gpu_default" with the default driver
Creating gpu_test_1 ... done
Attaching to gpu_test_1
test_1  | 2020-11-13 20:49:54.444634: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
.....
test_1  | 2020-11-13 20:49:56.048674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 13970 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
test_1  | /device:GPU:0
gpu_test_1 exited with code 0

Tested on a multi-GPU host:

$ nvidia-smi 
Fri Nov 13 20:57:48 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   72C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   67C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   74C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   62C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Enable access only to GPU-0 and GPU-3 devices

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;print(tf.test.gpu_device_name())"
    deploy:
      resources:
        reservations:
          devices:
          - 'driver': 'nvidia'
            'device_ids': ['0','3']
            'capabilities': ['gpu']
$ docker-compose up
...
test_1  | 2020-11-13 21:02:52.076151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 13970 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1b.0, compute capability: 7.5)
test_1  | 2020-11-13 21:02:52.076752: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
test_1  | 2020-11-13 21:02:52.077844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:1 with 13970 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
test_1  | /device:GPU:0
gpu_test_1 exited with code 0

Requires compose-spec/compose-spec#109
Closes #6691

@opptimus
Copy link

Wonderful and significant features, expect to be compiled!

Copy link
Member

@chris-crone chris-crone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change but LGTM

@@ -179,6 +180,7 @@ def __init__(
ipc_mode=None,
pid_mode=None,
default_platform=None,
device_requests=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful of changing function parameter ordering. It's possible that users are calling this function with positional parameters so it's best to add new parameters to the end of the list (i.e.: after extra_labels).

Signed-off-by: aiordache <anca.iordache@docker.com>
@ndeloof ndeloof merged commit 854c003 into docker:master Nov 17, 2020
@aiordache aiordache added this to the 1.28.0 milestone Dec 7, 2020
facebook-github-bot pushed a commit to facebookresearch/detectron2 that referenced this pull request Feb 9, 2021
Summary:
Updating gpu access from docker-compose, pointed by this comment:

https://github.com/facebookresearch/detectron2/blob/45a8bfb64053d71d9d7f136fb25a6abe841dc91f/docker/docker-compose.yml#L9

The solution comes from this [pull request](docker/compose#6691), and working since 1.28.0 [release](https://github.com/docker/compose/releases). It's the [official](docker/compose#7929) replace of `runtime: nvidia`

In this way, we don't need to install nvidia-docker (less prerequisites 🎉), but nvidia-container-toolkit seems to be needed yet.

Pull Request resolved: #2584

Reviewed By: theschnitz

Differential Revision: D26318490

Pulled By: ppwwyyxx

fbshipit-source-id: f732a8d05dbd42cd72d228719507ac45caa86ea4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for NVIDIA GPUs under Docker Compose
5 participants