Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 build for AWS Linux - Failed to load image Python extension #8305

Open
elkay opened this issue Mar 9, 2024 · 6 comments
Open

aarch64 build for AWS Linux - Failed to load image Python extension #8305

elkay opened this issue Mar 9, 2024 · 6 comments

Comments

@elkay
Copy link

elkay commented Mar 9, 2024

馃悰 Describe the bug

Built Torch 2.1.2 and TorchVision 0.16.2 from source and running into the following problem:

/home/ec2-user/conda/envs/textgen/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/ec2-user/conda/envs/textgen/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZNK3c1017SymbolicShapeMeta18init_is_contiguousEv'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?

previously the error was about missing libs and not undefined symbol, so I believe the libs are correctly installed now. Building says:

Compiling extensions with following flags:
   FORCE_CUDA: False
   FORCE_MPS: False
   DEBUG: False
   TORCHVISION_USE_PNG: True
   TORCHVISION_USE_JPEG: True
   TORCHVISION_USE_NVJPEG: True
   TORCHVISION_USE_FFMPEG: True
   TORCHVISION_USE_VIDEO_CODEC: True
   NVCC_FLAGS:
 Compiling with debug mode OFF
 Found PNG library
 Building torchvision with PNG image support
   libpng version: 1.6.37
   libpng include path: /home/ec2-user/conda/envs/textgen/include/libpng16
 Running build on conda-build: False
 Running build on conda: True
 Building torchvision with JPEG image support
   libjpeg include path: /home/ec2-user/conda/envs/textgen/include
   libjpeg lib path: /home/ec2-user/conda/envs/textgen/lib
 Building torchvision without NVJPEG image support
 Building torchvision with ffmpeg support
   ffmpeg version: b'ffmpeg version 4.2.2 Copyright (c) 2000-2019 the FFmpeg developers\nbuilt with gcc 10.2.0 (crosstool-NG 1.22.0.1750_510dbc6_dirty)\nconfiguration: --prefix=/opt/conda/conda-bld/ffmpeg_1622823166193/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh --cc=/opt/conda/conda-bld/ffmpeg_1622823166193/_build_env/bin/aarch64-conda-linux-gnu-cc --disable-doc --enable-avresample --enable-gmp --enable-hardcoded-tables --enable-libfreetype --enable-libvpx --enable-pthreads --enable-libopus --enable-postproc --enable-pic --enable-pthreads --enable-shared --enable-static --enable-version3 --enable-zlib --enable-libmp3lame --disable-nonfree --enable-gpl --enable-gnutls --disable-openssl --enable-libopenh264 --enable-libx264\nlibavutil      56. 31.100 / 56. 31.100\nlibavcodec     58. 54.100 / 58. 54.100\nlibavformat    58. 29.100 / 58. 29.100\nlibavdevice    58.  8.100 / 58.  8.100\nlibavfilter     7. 57.100 /  7. 57.100\nlibavresample   4.  0.  0 /  4.  0.  0\nlibswscale      5.  5.100 /  5.  5.100\nlibswresample   3.  5.100 /  3.  5.100\nlibpostproc    55.  5.100 / 55.  5.100\n'
   ffmpeg include path: ['/home/ec2-user/conda/envs/textgen/include']
   ffmpeg library_dir: ['/home/ec2-user/conda/envs/textgen/lib']
 Building torchvision without video codec support

So I believe I do have things set up correctly to be able to do image calls (I don't care about video). Any idea why I would still be getting the undefined symbol warning? Thanks!

Versions

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2023.3.20240304 (aarch64)
GCC version: (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.34

Python version: 3.10.9 (main, Mar 8 2023, 10:41:45) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.1.79-99.164.amzn2023.aarch64-aarch64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA T4G
Nvidia driver version: 550.54.14
cuDNN version: Probably one of the following:
/usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn.so.8.9.4
/usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_adv_infer.so.8.9.4
/usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_adv_train.so.8.9.4
/usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_cnn_infer.so.8.9.4
/usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_cnn_train.so.8.9.4
/usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_ops_infer.so.8.9.4
/usr/local/cuda-12.2/targets/sbsa-linux/lib/libcudnn_ops_train.so.8.9.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
L1d cache: 256 KiB (4 instances)
L1i cache: 256 KiB (4 instances)
L2 cache: 4 MiB (4 instances)
L3 cache: 32 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2+cu121
[pip3] torchaudio==2.1.2
[pip3] torchvision==0.16.2+cu121
[pip3] triton==2.1.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] torch 2.1.2+cu121 pypi_0 pypi
[conda] torchaudio 2.1.2 pypi_0 pypi
[conda] torchvision 0.16.2+cu121 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypi

@NicolasHug
Copy link
Member

 [pip3] torchvision==0.16.2+cu121
 [conda] torchvision 0.16.2+cu121 pypi_0 pypi

Try uninstalling these versions first?

@elkay
Copy link
Author

elkay commented Mar 12, 2024

 [pip3] torchvision==0.16.2+cu121
 [conda] torchvision 0.16.2+cu121 pypi_0 pypi

Try uninstalling these versions first?

What would that accomplish? That's literally the package that I'm trying to use and that is throwing the error.

@NicolasHug
Copy link
Member

Built Torch 2.1.2 and TorchVision 2.1.2 from source

What version of torchvision are you building from source, exactly? There's no torchvision 2.x. The latest stable version is 0.17.

The fact that there already is a stable 0.16.2 version installed while you're trying to build from source is very likely to be causing some issues.

@elkay
Copy link
Author

elkay commented Mar 12, 2024

Built Torch 2.1.2 and TorchVision 2.1.2 from source

What version of torchvision are you building from source, exactly? There's no torchvision 2.x. The latest stable version is 0.17.

The fact that there already is a stable 0.16.2 version installed while you're trying to build from source is very likely to be causing some issues.

Updated original post, torchvision version was a typo.

I did finally get torchvision to build and be functional, but only by forcibly editing the build scripts to pull in my custom build of torch+cuda 2.1.2. The build scripts were importing a non-cuda build because there is no aarch64 torch+cuda out there for pip to pull down. So finally, after forcing my own torch+cuda 2.1.2 whl into the torchvision build, now my torchvision actually works.

I need to say - it's been PAINFUL dealing with building anything that relies on torch because all the build scripts pull down the non-cuda version and mess up the builds. Every time I want to build something relying on torch, now I need to hack in pulling my own torch whl instead for them to work (this also resolved issues I was having building a few other things).

I reaaaaaally hope official aarch64 torch+cuda builds start to be made available so I don't have to keep doing this hackjob.

@NicolasHug
Copy link
Member

What build script are you referring to? Can you share the build command you used?

@elkay
Copy link
Author

elkay commented Mar 12, 2024

The box is shut down but I believe it was pyproject.toml that I had to update to point directly at my torch whl and the command I used was "python setup.py bdist_wheel". I had the same outcomes with "pip install -v ." to directly install from source, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants