nvproxy: unknown control command 0x3d05 #10413

thundergolfer · 2024-05-09T03:04:49Z

Description

Doing multi-GPU training on A100s and seeing that on gVisor it gets stuck. Tried the below program on the following GPUs within Modal:

A100 40 GiB (Oracle Cloud) ❌
H100 (a3-highgpu-8g) ❌
A10G ✔️
T4 ✔️

Both the H100 and A100 run into these unknown control commands:

W0509 01:16:28.218428  1772489 frontend.go:521] [   6:  20] nvproxy: unknown control command 0x3d05 (paramsSize=24)
W0509 01:16:28.218780  1772489 frontend.go:521] [   5:  22] nvproxy: unknown control command 0x3d05 (paramsSize=24)

Which is NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD -> https://github.com/NVIDIA/open-gpu-kernel-modules/blob/083cd9cf17ab95cd6f9fb50a5349c21eaa2f7d4b/src/common/sdk/nvidia/inc/ctrl/ctrl0000/ctrl0000unix.h#L146-L147

Steps to reproduce

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module


	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

torch.set_float32_matmul_precision('medium')
train_dl, val_dl = prepare_data()
model = MagixNet(100)
trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

start  = time.time()
trainer.fit(model, train_dl, val_dl)
print(f"Training duration (seconds): {time.time() - start:.2f}")
EOF

ENTRYPOINT ["python3", "repro.py"]

runsc version

`runsc version 6e61813c1b37
spec: 1.1.0-rc.1`

docker version (if using docker)

N/A

uname

No response

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

The text was updated successfully, but these errors were encountered:

thundergolfer · 2024-05-09T03:26:02Z

The reproduction program is almost identical to the one in #9827, which is why I revisited that issue's test.

ayushr2 · 2024-05-09T15:31:33Z

This seems to be running fine for me on an A100-40GB machine in GCE on driver version 535.104.05:

(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='15:24:33') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9193099.59it/s] 
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 156MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M    Trainable params
0         Non-trainable params
23.7 M    Total params
94.852    Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00,  5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00,  5.62it/s, v_num=0]

-------------------------------------------------------------------------------
repro.py 63 <module>
print(f"Training duration (seconds): {time.time() - start:2.f}")

ValueError:
Format specifier missing precision

(base) ayushranjan_google_com@a100:~/issue10413$ nvidia-smi
Thu May  9 15:27:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              49W / 400W |      4MiB / 40960MiB |     27%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Please note:

There seems to be an issue with the last print statement in repro.py. Other than that, the application seems to work fine.
I am using --shm-size=128g as per NV50_P2P allocation class unimplemented in nvproxy #9827 (comment).
The debug logs don't have any nvproxy: unknown lines.

So maybe you are using a different driver version? Or maybe something to do with the Oracle Cloud environment?

thundergolfer · 2024-05-09T15:49:21Z

Oh yep, fixed that in the original description.
Our --shm-size is also set very large. On Oracle workers it's around 1657GB.

We have Driver Version: 535.129.03 CUDA Version: 12.2. Sorry should have included that in the issue originally!

On H100 worker:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:04:00.0 Off |                    0 |
| N/A   36C    P0             113W / 700W |  72459MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:05:00.0 Off |                    0 |
| N/A   34C    P0             117W / 700W |  72507MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0             114W / 700W |  72507MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   33C    P0             111W / 700W |  72587MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:84:00.0 Off |                    0 |
| N/A   60C    P0             578W / 700W |  71533MiB / 81559MiB |     95%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:85:00.0 Off |                    0 |
| N/A   34C    P0             112W / 700W |    841MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0             114W / 700W |  16463MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:8B:00.0 Off |                    0 |
| N/A   34C    P0             111W / 700W |   2405MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    759790      C   /opt/conda/bin/python3.10                 72446MiB |

We use the same driver version across all GPU workers.

ayushr2 · 2024-05-09T16:14:03Z

Updated driver version and still can not repro the failure on my GCE VM:

(base) ayushranjan_google_com@a100:~/issue10413$ docker run --runtime=runsc --shm-size=128g --gpus=all --rm issue10413:latest
Hello from inside container.
Processes: current_process=psutil.Process(pid=1, name='python3', status='running', started='16:01:41') parent_process=None
Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to data/cifar-100-python.tar.gz
100%|██████████| 169001437/169001437 [00:18<00:00, 9140159.09it/s] 
Extracting data/cifar-100-python.tar.gz to data
Files already downloaded and verified
Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth
100%|██████████| 97.8M/97.8M [00:01<00:00, 74.1MB/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.8/dist-packages/lightning/pytorch/trainer/configuration_validator.py:72: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: /lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type   | Params
----------------------------------
0 | module | ResNet | 23.7 M
----------------------------------
23.7 M    Trainable params
0         Non-trainable params
23.7 M    Total params
94.852    Total estimated model params size (MB)
Epoch 0: 100%|██████████| 391/391 [01:08<00:00,  5.68it/s, v_num=0]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 391/391 [01:09<00:00,  5.62it/s, v_num=0]
Training duration (seconds): 72.35

Surprisingly, this workload gets stuck without gVisor. I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.

thundergolfer · 2024-05-09T16:17:18Z

Surprisingly, this workload gets stuck without gVisor.

Interesting. This may be the same problem as in #9827 where the test got stuck on runc.

The program doesn't get stuck on runc in Modal. It completes in around 60s. A 72.35 second completion for gVisor lines up with that.

I will add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy though, hopefully it resolves whatever failure you are seeing.

🙏

Updates #10413 PiperOrigin-RevId: 632173012

Updates #10413 PiperOrigin-RevId: 632277477

ayushr2 · 2024-05-09T22:02:30Z

@thundergolfer Let me know if e9b3218 fixes the issue. If so, please close this.

This is helpful for handling parameter types that have one field for frontend FD that needs to be translated (and are simple apart from that). Avoids repetitive code. Rename HasRMCtrlFD->HasFrontendFD so it can have a broader meaning. Implement generic handlers for frontend ioctl and control commands. Updates #10413. PiperOrigin-RevId: 633238248

This is helpful for handling parameter types that have one field for frontend FD that needs to be translated (and are simple apart from that). Avoids repetitive code. Rename HasRMCtrlFD->HasFrontendFD so it can have a broader meaning. Implement generic handlers for frontend ioctl and control commands. Updates #10413. PiperOrigin-RevId: 633700330

…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO Following up on #10413 (comment). Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs. ```Dockerfile FROM python:3.11-slim-bookworm RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim RUN wget https://bootstrap.pypa.io/get-pip.py RUN python3 get-pip.py RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above. RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler COPY <<EOF repro.py print("Hello from inside container.") import psutil current_process = psutil.Process() parent_process = current_process.parent() print(f"Processes: {current_process=} {parent_process=}") import time import torch import torch.nn as nn import torch.nn.functional as F import lightning as L from memory_profiler import profile from torchvision.datasets import CIFAR100 from torchvision import transforms from torchvision import models from torch.utils.data import DataLoader import os import signal import pathlib def handler(signum, frame): print('Signal handler called with signal', signum) os.waitpid(-1, 0) raise KeyboardInterrupt() # gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17' signal.signal(signal.SIGCHLD, handler) class MagixNet(L.LightningModule): def __init__(self, nbr_cat): super().__init__() module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT) module.fc = nn.Linear(2048, nbr_cat) self.module = module def forward(self, x): return self.module(x) def training_step(self, batch, batch_idx): x, y = batch y_hat = self(x) loss = F.cross_entropy(y_hat, y) return loss def configure_optimizers(self): return torch.optim.Adam(self.parameters(), lr=0.02) def prepare_data(): pipeline = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), ]) train_ds = CIFAR100('data', train=True, download=True, transform=pipeline) train_dl = DataLoader(train_ds, batch_size=128, num_workers=4) val_ds = CIFAR100('data', train=False, download=True, transform=pipeline) val_dl = DataLoader(val_ds, batch_size=128, num_workers=4) return train_dl, val_dl if __name__ == "__main__": torch.set_float32_matmul_precision('medium') train_dl, val_dl = prepare_data() model = MagixNet(100) trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook") start = time.time() trainer.fit(model, train_dl, val_dl) print(f"Training duration (seconds): {time.time() - start:.2f}") nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt") if nccl_debug_file.exists(): print("NCCL Debugging") print(nccl_debug_file.read_text()) EOF ENTRYPOINT ["python3", "repro.py"] ``` Run like this: ``` sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628 ``` **`/etc/docker/daemon.json`** ``` { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] }, "runsc-2": { "path": "/home/modal/runsc2", "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"] }, } } ``` FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master 76bf495 PiperOrigin-RevId: 635812044

…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO Following up on #10413 (comment). Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs. ```Dockerfile FROM python:3.11-slim-bookworm RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim RUN wget https://bootstrap.pypa.io/get-pip.py RUN python3 get-pip.py RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above. RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler COPY <<EOF repro.py print("Hello from inside container.") import psutil current_process = psutil.Process() parent_process = current_process.parent() print(f"Processes: {current_process=} {parent_process=}") import time import torch import torch.nn as nn import torch.nn.functional as F import lightning as L from memory_profiler import profile from torchvision.datasets import CIFAR100 from torchvision import transforms from torchvision import models from torch.utils.data import DataLoader import os import signal import pathlib def handler(signum, frame): print('Signal handler called with signal', signum) os.waitpid(-1, 0) raise KeyboardInterrupt() # gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17' signal.signal(signal.SIGCHLD, handler) class MagixNet(L.LightningModule): def __init__(self, nbr_cat): super().__init__() module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT) module.fc = nn.Linear(2048, nbr_cat) self.module = module def forward(self, x): return self.module(x) def training_step(self, batch, batch_idx): x, y = batch y_hat = self(x) loss = F.cross_entropy(y_hat, y) return loss def configure_optimizers(self): return torch.optim.Adam(self.parameters(), lr=0.02) def prepare_data(): pipeline = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), ]) train_ds = CIFAR100('data', train=True, download=True, transform=pipeline) train_dl = DataLoader(train_ds, batch_size=128, num_workers=4) val_ds = CIFAR100('data', train=False, download=True, transform=pipeline) val_dl = DataLoader(val_ds, batch_size=128, num_workers=4) return train_dl, val_dl if __name__ == "__main__": torch.set_float32_matmul_precision('medium') train_dl, val_dl = prepare_data() model = MagixNet(100) trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook") start = time.time() trainer.fit(model, train_dl, val_dl) print(f"Training duration (seconds): {time.time() - start:.2f}") nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt") if nccl_debug_file.exists(): print("NCCL Debugging") print(nccl_debug_file.read_text()) EOF ENTRYPOINT ["python3", "repro.py"] ``` Run like this: ``` sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628 ``` **`/etc/docker/daemon.json`** ``` { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] }, "runsc-2": { "path": "/home/modal/runsc2", "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"] }, } } ``` FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master bf18079 PiperOrigin-RevId: 635812044

thundergolfer added the type: bug Something isn't working label May 9, 2024

copybara-service bot pushed a commit that referenced this issue May 9, 2024

Add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy.

7d96f53

Updates #10413 PiperOrigin-RevId: 632173012

copybara-service bot mentioned this issue May 9, 2024

Add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy. #10417

Merged

copybara-service bot pushed a commit that referenced this issue May 9, 2024

Add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy.

da67d3e

Updates #10413 PiperOrigin-RevId: 632173012

copybara-service bot pushed a commit that referenced this issue May 9, 2024

Add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy.

f951699

Updates #10413 PiperOrigin-RevId: 632173012

copybara-service bot pushed a commit that referenced this issue May 9, 2024

Add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy.

6cd63a8

Updates #10413 PiperOrigin-RevId: 632173012

copybara-service bot pushed a commit that referenced this issue May 9, 2024

Add NV0000_CTRL_CMD_OS_UNIX_EXPORT_OBJECT_TO_FD to nvproxy.

e9b3218

Updates #10413 PiperOrigin-RevId: 632277477

ayushr2 added the area: gpu Issue related to sandboxed GPU access label May 9, 2024

thundergolfer mentioned this issue May 12, 2024

Partially address Issue #10413 by adding NV0000_CTRL_CMD_OS_UNIX_GET_EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO #10434

Merged

copybara-service bot mentioned this issue May 13, 2024

nvproxy: Expand HasRMCtrlFD idea to frontend ioctl and control commands. #10436

Merged

copybara-service bot mentioned this issue May 21, 2024

Partially address Issue #10413 by adding NV0000_CTRL_CMD_OS_UNIX_GET_EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO #10471

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvproxy: unknown control command 0x3d05 #10413

nvproxy: unknown control command 0x3d05 #10413

thundergolfer commented May 9, 2024 •

edited

thundergolfer commented May 9, 2024

ayushr2 commented May 9, 2024

thundergolfer commented May 9, 2024 •

edited

ayushr2 commented May 9, 2024

thundergolfer commented May 9, 2024

ayushr2 commented May 9, 2024

nvproxy: unknown control command 0x3d05 #10413

nvproxy: unknown control command 0x3d05 #10413

Comments

thundergolfer commented May 9, 2024 • edited

Description

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

thundergolfer commented May 9, 2024

ayushr2 commented May 9, 2024

thundergolfer commented May 9, 2024 • edited

ayushr2 commented May 9, 2024

thundergolfer commented May 9, 2024

ayushr2 commented May 9, 2024

thundergolfer commented May 9, 2024 •

edited

thundergolfer commented May 9, 2024 •

edited