Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvproxy: unknown control command 0x3d05 #10413

Open
thundergolfer opened this issue May 9, 2024 · 6 comments
Open

nvproxy: unknown control command 0x3d05 #10413

thundergolfer opened this issue May 9, 2024 · 6 comments
Labels
area: gpu Issue related to sandboxed GPU access type: bug Something isn't working

Comments

@thundergolfer
Copy link
Contributor

thundergolfer commented May 9, 2024

Description

Doing multi-GPU training on A100s and seeing that on gVisor it gets stuck. Tried the below program on the following GPUs within Modal:

  • A100 40 GiB (Oracle Cloud) ❌
  • H100 (a3-highgpu-8g) ❌
  • A10G ✔️
  • T4 ✔️

Both the H100 and A100 run into these unknown control commands:

W0509 01:16:28.218428  1772489 frontend.go:521] [   6:  20] nvproxy: unknown control command 0x3d05 (paramsSize=24)
W0509 01:16:28.218780  1772489 frontend.go:521] [   5:  22] nvproxy: unknown control command 0x3d05 (paramsSize=24)

Which is NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD -> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/083cd9cf17ab95cd6f9fb50a5349c21eaa2f7d4b/src/common/sdk/nvidia/inc/ctrl/ctrl0000/ctrl0000unix.h#L146-L147

Steps to reproduce

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module


	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

torch.set_float32_matmul_precision('medium')
train_dl, val_dl = prepare_data()
model = MagixNet(100)
trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

start  = time.time()
trainer.fit(model, train_dl, val_dl)
print(f"Training duration (seconds): {time.time() - start:.2f}")
EOF

ENTRYPOINT ["python3", "repro.py"]

runsc version

`runsc version 6e61813c1b37
spec: 1.1.0-rc.1`

docker version (if using docker)

N/A

uname

No response

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

@thundergolfer thundergolfer added the type: bug Something isn't working label May 9, 2024
@thundergolfer
Copy link
Contributor Author

The reproduction program is almost identical to the one in #9827, which is why I revisited that issue's test.

@ayushr2
Copy link
Collaborator

ayushr2 commented May 9, 2024

This seems to be running fine for me on an A100-40GB machine in GCE on driver version 535.104.05:

(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='15:24:33') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9193099.59it/s] 
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 156MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M    Trainable params
0         Non-trainable params
23.7 M    Total params
94.852    Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00,  5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00,  5.62it/s, v_num=0]

-------------------------------------------------------------------------------
repro.py 63 <module>
print(f"Training duration (seconds): {time.time() - start:2.f}")

ValueError:
Format specifier missing precision
(base) ayushranjan_google_com@a100:~/issue10413$ nvidia-smi
Thu May  9 15:27:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              49W / 400W |      4MiB / 40960MiB |     27%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Please note:

So maybe you are using a different driver version? Or maybe something to do with the Oracle Cloud environment?

@thundergolfer
Copy link
Contributor Author

thundergolfer commented May 9, 2024

  • Oh yep, fixed that in the original description.
  • Our --shm-size is also set very large. On Oracle workers it's around 1657GB.

We have Driver Version: 535.129.03 CUDA Version: 12.2. Sorry should have included that in the issue originally!

On H100 worker:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:04:00.0 Off |                    0 |
| N/A   36C    P0             113W / 700W |  72459MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:05:00.0 Off |                    0 |
| N/A   34C    P0             117W / 700W |  72507MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0             114W / 700W |  72507MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   33C    P0             111W / 700W |  72587MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:84:00.0 Off |                    0 |
| N/A   60C    P0             578W / 700W |  71533MiB / 81559MiB |     95%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:85:00.0 Off |                    0 |
| N/A   34C    P0             112W / 700W |    841MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0             114W / 700W |  16463MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:8B:00.0 Off |                    0 |
| N/A   34C    P0             111W / 700W |   2405MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    759790      C   /opt/conda/bin/python3.10                 72446MiB |

We use the same driver version across all GPU workers.

@ayushr2
Copy link
Collaborator

ayushr2 commented May 9, 2024

Updated driver version and still can not repro the failure on my GCE VM:

(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='16:01:41') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9140159.09it/s] 
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 74.1MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M    Trainable params
0         Non-trainable params
23.7 M    Total params
94.852    Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00,  5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00,  5.62it/s, v_num=0]
Training duration (seconds): 72.35

Surprisingly, this workload gets stuck without gVisor. I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.

@thundergolfer
Copy link
Contributor Author

Surprisingly, this workload gets stuck without gVisor.

Interesting. This may be the same problem as in #9827 where the test got stuck on runc.

The program doesn't get stuck on runc in Modal. It completes in around 60s. A 72.35 second completion for gVisor lines up with that.

I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.

🙏

copybara-service bot pushed a commit that referenced this issue May 9, 2024
copybara-service bot pushed a commit that referenced this issue May 9, 2024
copybara-service bot pushed a commit that referenced this issue May 9, 2024
copybara-service bot pushed a commit that referenced this issue May 9, 2024
copybara-service bot pushed a commit that referenced this issue May 9, 2024
@ayushr2
Copy link
Collaborator

ayushr2 commented May 9, 2024

@thundergolfer Let me know if e9b3218 fixes the issue. If so, please close this.

@ayushr2 ayushr2 added the area: gpu Issue related to sandboxed GPU access label May 9, 2024
copybara-service bot pushed a commit that referenced this issue May 13, 2024
This is helpful for handling parameter types that have one field for frontend
FD that needs to be translated (and are simple apart from that). Avoids
repetitive code.

Rename HasRMCtrlFD->HasFrontendFD so it can have a broader meaning.
Implement generic handlers for frontend ioctl and control commands.

Updates #10413.

PiperOrigin-RevId: 633238248
copybara-service bot pushed a commit that referenced this issue May 14, 2024
This is helpful for handling parameter types that have one field for frontend
FD that needs to be translated (and are simple apart from that). Avoids
repetitive code.

Rename HasRMCtrlFD->HasFrontendFD so it can have a broader meaning.
Implement generic handlers for frontend ioctl and control commands.

Updates #10413.

PiperOrigin-RevId: 633238248
copybara-service bot pushed a commit that referenced this issue May 14, 2024
This is helpful for handling parameter types that have one field for frontend
FD that needs to be translated (and are simple apart from that). Avoids
repetitive code.

Rename HasRMCtrlFD->HasFrontendFD so it can have a broader meaning.
Implement generic handlers for frontend ioctl and control commands.

Updates #10413.

PiperOrigin-RevId: 633238248
copybara-service bot pushed a commit that referenced this issue May 14, 2024
This is helpful for handling parameter types that have one field for frontend
FD that needs to be translated (and are simple apart from that). Avoids
repetitive code.

Rename HasRMCtrlFD->HasFrontendFD so it can have a broader meaning.
Implement generic handlers for frontend ioctl and control commands.

Updates #10413.

PiperOrigin-RevId: 633700330
copybara-service bot pushed a commit that referenced this issue May 21, 2024
…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

```Dockerfile
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module

	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]
```

Run like this:

```
sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628
```

**`/etc/docker/daemon.json`**

```
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}
```

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master 76bf495
PiperOrigin-RevId: 635812044
copybara-service bot pushed a commit that referenced this issue May 21, 2024
…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

```Dockerfile
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module

	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]
```

Run like this:

```
sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628
```

**`/etc/docker/daemon.json`**

```
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}
```

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master bf18079
PiperOrigin-RevId: 635812044
copybara-service bot pushed a commit that referenced this issue May 21, 2024
…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

```Dockerfile
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module

	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]
```

Run like this:

```
sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628
```

**`/etc/docker/daemon.json`**

```
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}
```

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master bf18079
PiperOrigin-RevId: 635812044
copybara-service bot pushed a commit that referenced this issue May 21, 2024
…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

```Dockerfile
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module

	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]
```

Run like this:

```
sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628
```

**`/etc/docker/daemon.json`**

```
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}
```

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master bf18079
PiperOrigin-RevId: 635812044
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: gpu Issue related to sandboxed GPU access type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants