Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 [Bug] error: backend='torch_tensorrt' raised: TypeError: pybind11::init(): factory function returned nullptr #2827

Open
geraldstanje opened this issue May 10, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@geraldstanje
Copy link

geraldstanje commented May 10, 2024

Bug Description

hi i see the following error - it looks like the torch.compile worked fine but when i invoke the prediction after that it errors out:

[INFO ] W-9001-model_1.0-stdout MODEL_LOG - [05/10/2024-[W] Unable to determine GPU memory usage
[INFO ] W-9001-model_1.0-stdout MODEL_LOG - [05/10/2024-[TRT] [W] Unable to determine GPU memory usage
[INFO ] W-9001-model_1.0-stdout MODEL_LOG - [05/10/2024-[TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1104, GPU 0 (MiB)
[INFO ] W-9001-model_1.0-stdout MODEL_LOG - [05/10/2024-[TRT] [W] CUDA initialization failure with error: 35. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
predict_fn error: backend='torch_tensorrt' raised: TypeError: pybind11::init(): factory function returned nullptr

does pytorch-tensorrt work with a g4dn.xlarge? why i get this: CUDA initialization failure with error: 35?

full log:
tensorrt_torch_error.txt

To Reproduce

Steps to reproduce the behavior:

  1. build container with tensorrt
# use sagemaker DLC
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker

# Install additional dependencies
RUN python -m pip install torch torch-tensorrt tensorrt --extra-index-ur https://download.pytorch.org/whl/cu118

how was the model compiled?

model.model_body[0].auto_model = torch.compile(model.model_body[0].auto_model, backend="torch_tensorrt", dynamic=False,
                                options={"truncate_long_and_double": True,
                                         "precision": torch.half,
                                         "debug": True,
                                         "min_block_size": 1,
                                         "optimization_level": 4,
                                         "use_python_runtime": False})

to rule out that the issue is somewhere else - i tested with the following torch.compile - this works fine:

model.model_body[0].auto_model = torch.compile(model.model_body[0].auto_model, mode="reduce-overhead")

should i try some other settings for torch.compile(model.model_body[0].auto_model, backend="torch_tensorrt" ?

could the error be related to NVIDIA/TensorRT#308 ?

Expected behavior

no error

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • Torch-TensorRT Version (e.g. 1.0.0):
  • PyTorch Version (e.g. 1.0): 2.1
  • CPU Architecture: g4dn.xlarge
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, libtorch, source):
  • Build command you used (if compiling from source):
  • Are you using local sources or building from archives:
  • Python version:
  • CUDA version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

@geraldstanje geraldstanje added the bug Something isn't working label May 10, 2024
@narendasan
Copy link
Collaborator

Can you share something like the NVIDIA-SMI print out that can show us the driver version and status?

@geraldstanje
Copy link
Author

@narendasan sure. in the meantime where can i check compatibility of cuda driver, pytorch version, pytorch/TensorRT version etc.?

@narendasan
Copy link
Collaborator

For PyTorch vs Torch-TensorRT compatibility, the versions are aligned, so PyTorch v2.2.0 <-> Torch-TensorRT v2.2.0 (prior to PyTorch 2.0, it would be something like PyTorch 1.13 <-> Torch-TensorRT 1.3.0). For driver compatibility this is based on CUDA https://docs.nvidia.com/deploy/cuda-compatibility/index.html. So if your PyTorch build targets CUDA 11.8 you need >= 450.80.02. If you are using a 12.1 PyTorch then you need to use >=525.60.13. NVIDIA-SMI can help you determine if your CUDA and CUDA-Driver are aligned.

@geraldstanje
Copy link
Author

@narendasan looks like the cuda driver is: cuda-11-4-11.4.0-1 - which means i cannot use tensorrt without upgrading the cuda driver?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants