Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux installer's Forge causes MATLAB to crash #231

Open
villekf opened this issue May 18, 2020 · 14 comments
Open

Linux installer's Forge causes MATLAB to crash #231

villekf opened this issue May 18, 2020 · 14 comments
Assignees
Labels
Milestone

Comments

@villekf
Copy link

villekf commented May 18, 2020

The Forge included in the official Linux installer is currently causing MATLAB to crash (segfault) when an ArrayFire mex-code is run (the below example causes this).

#include "arrayfire.h"
#include "mex.h"


void mexFunction(int nlhs, mxArray *plhs[],
	int nrhs, const mxArray *prhs[])

{
	const char * c = af::infoString();

	mexPrintf("\n%s\n", c);

	plhs[0] = mxCreateString(c);
}

On Windows, using the installer and MATLAB has no issues. Furthermore, building AF from source with Forge ON (or OFF) also works. If these built Forge libraries are replaced with the ones from the installer, the crashes start to occur. Removing the Forge libraries altogether removes the crashes.

This occurs both with AF 3.7.1 and 3.6.2 (where the no-gl installer works). GCC/G++ used was 7.5.0 with Ubuntu 18.04.

@9prady9
Copy link
Member

9prady9 commented May 18, 2020

@villekf Did you check if right set of libraries are being used when arrayfire is being used from ArrayFire ? You can run the program with export AF_TRACE=all to see which libraries are being loaded.

@villekf
Copy link
Author

villekf commented May 18, 2020

With export AF_TRACE=all the output is

[platform][1589810797][014006] [ ../src/backend/common/DependencyModule.cpp:55 ] Attempting to load: libforge.so

after which the crash occurs. libforge.so can only be found from the AF install lib64-folder.

@9prady9
Copy link
Member

9prady9 commented May 18, 2020

It should have continued without a crash - that's the expected behavior, even if it didn't load forge library. Can you please the entire log of all trace output.

@villekf
Copy link
Author

villekf commented May 18, 2020

If the Forge files are in the library folder, that is all the output there is (MATLAB segfaults).

If I remove them, the output becomes this on OpenCL (and everything works without crashes):

[platform][1589810519][010979] [ ../src/backend/common/DependencyModule.cpp:55 ] Attempting to load: libforge.so
[platform][1589810519][010979] [ ../src/backend/common/DependencyModule.cpp:60 ] Unable to open forge
[platform][1589810519][010979] [ ../src/backend/opencl/device_manager.cpp:204 ] Found 1 OpenCL platforms
[platform][1589810519][010979] [ ../src/backend/opencl/device_manager.cpp:216 ] Found 2 devices on platform NVIDIA CUDA
[platform][1589810519][010979] [ ../src/backend/opencl/device_manager.cpp:221 ] Found device Tesla P100-PCIE-16GB on platform NVIDIA CUDA
[platform][1589810519][010979] [ ../src/backend/opencl/device_manager.cpp:221 ] Found device Quadro K620 on platform NVIDIA CUDA
[platform][1589810519][010979] [ ../src/backend/opencl/device_manager.cpp:226 ] Found 2 OpenCL devices
[platform][1589810519][010979] [ ../src/backend/opencl/device_manager.cpp:320 ] Default device: 0

@9prady9
Copy link
Member

9prady9 commented May 18, 2020

Which backend are you using ?

Based on your output, the program does progress after trying to load forge and it fails. I am guessing somehow the forge lib is corrupt because we do handle the scenario when forge loading fails, in all backends.

Can you try reproducing this using regular C++ example - try any C++ graphics example that get's shipped with out installer ? If yes, then the problem is within the code. If not, then it must be something related to environment that mex program starts with I would think.

@villekf
Copy link
Author

villekf commented May 18, 2020

I'm using OpenCL, but the same thing happens for both CPU and CUDA as well.

Running the conway_opencl example produces the following when using the installer libraries:

[platform][1589814692][008749] [ ../src/backend/common/DependencyModule.cpp:55 ] Attempting to load: libforge.so
[platform][1589814692][008749] [ ../src/backend/common/DependencyModule.cpp:58 ] Found: libforge.so
[platform][1589814692][008749] [ ../src/backend/opencl/device_manager.cpp:204 ] Found 1 OpenCL platforms
[platform][1589814692][008749] [ ../src/backend/opencl/device_manager.cpp:216 ] Found 2 devices on platform NVIDIA CUDA
[platform][1589814692][008749] [ ../src/backend/opencl/device_manager.cpp:221 ] Found device Tesla P100-PCIE-16GB on platform NVIDIA CUDA
[platform][1589814692][008749] [ ../src/backend/opencl/device_manager.cpp:221 ] Found device Quadro K620 on platform NVIDIA CUDA
[platform][1589814692][008749] [ ../src/backend/opencl/device_manager.cpp:226 ] Found 2 OpenCL devices
GLX: GLX version 1.3 is requiredError: Could not Create GLFW Window!
[platform][1589814692][008749] [ ../src/backend/opencl/device_manager.cpp:320 ] Default device: 0
ArrayFire v3.7.1 (OpenCL, 64-bit Linux, build d9d9b65)
[0] NVIDIA: Tesla P100-PCIE-16GB, 16280 MB
-1- NVIDIA: Quadro K620, 2000 MB
This example demonstrates the Conway's Game of Life using ArrayFire
There are 4 simple rules of Conways's Game of Life
1. Any live cell with fewer than two live neighbours dies, as if caused by under-population.
2. Any live cell with two or three live neighbours lives on to the next generation.
3. Any live cell with more than three live neighbours dies, as if by overcrowding.
4. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.
Each white block in the visualization represents 1 alive cell, black space represents dead cells
Segmentation fault (core dumped)

CUDA:

[platform][1589830491][014839] [ ../src/backend/common/DependencyModule.cpp:55 ] Attempting to load: libforge.so
[platform][1589830491][014839] [ ../src/backend/common/DependencyModule.cpp:58 ] Found: libforge.so
[platform][1589830491][014839] [ ../src/backend/cuda/device_manager.cpp:424 ] CUDA Driver supports up to CUDA 10.2 ArrayFire CUDA Runtime 10.0
[platform][1589830491][014839] [ ../src/backend/cuda/device_manager.cpp:492 ] Found 2 CUDA devices
[platform][1589830491][014839] [ ../src/backend/cuda/device_manager.cpp:518 ] Found device: Tesla P100-PCIE-16GB (15.9 GB | ~9081.54 GFLOPs | 56 SMs)
[platform][1589830491][014839] [ ../src/backend/cuda/device_manager.cpp:518 ] Found device: Quadro K620 (1.95 GB | ~823.242 GFLOPs | 3 SMs)
[platform][1589830491][014839] [ ../src/backend/cuda/device_manager.cpp:553 ] AF_CUDA_DEFAULT_DEVICE: 
[platform][1589830491][014839] [ ../src/backend/cuda/device_manager.cpp:572 ] Default device: 0(Tesla P100-PCIE-16GB)
ArrayFire v3.7.1 (CUDA, 64-bit Linux, build d9d9b65)
Platform: CUDA Runtime 10.0, Driver: 440.64.00
[0] Tesla P100-PCIE-16GB, 16281 MB, CUDA Compute 6.0
-1- Quadro K620, 2001 MB, CUDA Compute 5.0
This example demonstrates the Conway's Game of Life using ArrayFire
There are 4 simple rules of Conways's Game of Life
1. Any live cell with fewer than two live neighbours dies, as if caused by under-population.
2. Any live cell with two or three live neighbours lives on to the next generation.
3. Any live cell with more than three live neighbours dies, as if by overcrowding.
4. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.
Each white block in the visualization represents 1 alive cell, black space represents dead cells
GLX: GLX version 1.3 is requiredError: Could not Create GLFW Window!
ArrayFire Exception (Runtime error :103):
In function void* graphics::ForgeManager::getMainWindow()
In file src/backend/common/graphics_common.cpp:273
OpenGL Error
 0# 0x00007FC82781EBDE in /opt/arrayfire/lib/libafcuda.so.3
 1# 0x00007FC827821FAB in /opt/arrayfire/lib/libafcuda.so.3
 2# af_create_window in /opt/arrayfire/lib/libafcuda.so.3
 3# af::Window::initWindow(int, int, char const*) in /opt/arrayfire/lib/libafcuda.so.3
 4# main in ./conway_cuda
 5# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
 6# _start in ./conway_cuda

In function void af::Window::initWindow(int, int, const char*)
In file src/api/cpp/graphics.cpp:19
terminate called after throwing an instance of 'af::exception'
  what():  ArrayFire Exception (Runtime error :103):
In function void* graphics::ForgeManager::getMainWindow()
In file src/backend/common/graphics_common.cpp:273
OpenGL Error
 0# 0x00007FC82781EBDE in /opt/arrayfire/lib/libafcuda.so.3
 1# 0x00007FC827821FAB in /opt/arrayfire/lib/libafcuda.so.3
 2# af_create_window in /opt/arrayfire/lib/libafcuda.so.3
 3# af::Window::initWindow(int, int, char const*) in /opt/arrayfire/lib/libafcuda.so.3
 4# main in ./conway_cuda
 5# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
 6# _start in ./conway_cuda

In function void af::Window::initWindow(int, int, const char*)
In file src/api/cpp/graphics.cpp:19
Aborted (core dumped)

On the other hand, building from source (no crashes in MATLAB, this time the current master) produces the following with conway_opencl:

[platform][1589816971][027770] [ /home/Downloads/arrayfire/src/backend/common/DependencyModule.cpp:55 ] Attempting to load: libforge.so
[platform][1589816971][027770] [ /home/Downloads/arrayfire/src/backend/common/DependencyModule.cpp:58 ] Found: libforge.so
[platform][1589816971][027770] [ /home/Downloads/arrayfire/src/backend/opencl/device_manager.cpp:209 ] Found 1 OpenCL platforms
[platform][1589816971][027770] [ /home/Downloads/arrayfire/src/backend/opencl/device_manager.cpp:221 ] Found 2 devices on platform NVIDIA CUDA
[platform][1589816971][027770] [ /home/Downloads/arrayfire/src/backend/opencl/device_manager.cpp:226 ] Found device Tesla P100-PCIE-16GB on platform NVIDIA CUDA
[platform][1589816971][027770] [ /home/Downloads/arrayfire/src/backend/opencl/device_manager.cpp:226 ] Found device Quadro K620 on platform NVIDIA CUDA
[platform][1589816971][027770] [ /home/Downloads/arrayfire/src/backend/opencl/device_manager.cpp:231 ] Found 2 OpenCL devices
GLX: GLX version 1.3 is requiredError: Could not Create GLFW Window!
[platform][1589816972][027770] [ /home/Downloads/arrayfire/src/backend/opencl/device_manager.cpp:327 ] Default device: 0
ArrayFire v3.8.0 (OpenCL, 64-bit Linux, build 3ad4c0da)
[0] NVIDIA: Tesla P100-PCIE-16GB, 16280 MB
-1- NVIDIA: Quadro K620, 2000 MB
This example demonstrates the Conway's Game of Life using ArrayFire
There are 4 simple rules of Conways's Game of Life
1. Any live cell with fewer than two live neighbours dies, as if caused by under-population.
2. Any live cell with two or three live neighbours lives on to the next generation.
3. Any live cell with more than three live neighbours dies, as if by overcrowding.
4. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.
Each white block in the visualization represents 1 alive cell, black space represents dead cells
Segmentation fault (core dumped)

Non-graphics related examples work fine in both cases, e.g. helloworld completes fine (though the same GLX error shows up in OpenCL versions). The crashes seem to occur only in MATLAB as well as in GNU Octave (both mex- and oct-files).

@9prady9
Copy link
Member

9prady9 commented May 19, 2020

@villekf Is it possible for you to reach out to me on https://join.slack.com/t/arrayfire-org/shared_invite/MjI4MjIzMDMzMTczLTE1MDI5ODg4NzYtN2QwNGE3ODA5OQ we can continue discussion over there. I have some questions, a quick back and forth may resolve the problem quickly which is much easier on slack.

@9prady9
Copy link
Member

9prady9 commented Jun 1, 2020

Self reminder to @9prady9 Need to try this setup with Octave.

@9prady9 9prady9 self-assigned this Jul 20, 2020
@9prady9
Copy link
Member

9prady9 commented Jul 20, 2020

@villekf I was able to reproduce this problem. As you have experienced, the problem seems to stem only if forge.so from installer binary is being used. I shall update here once again as soon as I found a fix for the same. Thank you for reporting it!

Note that you can work around the issue by doing export AF_DISABLE_GRAPHICS=1 before launching octave session. This workaround worked for 3.6.4 version, although with 3.7.2(latest release), no matter the value of AF_DISABLE_GRAPHICS the code seems fail at the moment forge being loaded.

@9prady9
Copy link
Member

9prady9 commented May 6, 2021

Wow, finally found the problem.

There is global constant of type std::regex in forge library which is failing to initialize for some reason.

The root cause is the following: somehow an allocation request is being made during the creation of this regex constant with a size is very huge (seems like garbage value). Here is the stack trace

#0  0x00007ffff5727012 in __cxxabiv1::__cxa_throw(void*, std::type_info*, void (*)(void*))
    (obj=0x7fffdc6b5eb0, tinfo=0x7ffff5851ee0 <typeinfo for std::bad_alloc>, dest=0x7ffff57252e0 <std::bad_alloc::~bad_alloc()>) at /build/gcc/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:78
arrayfire/arrayfire#1  0x00007ffff571a438 in operator new(unsigned long) (sz=281474137743328) at /build/gcc/src/gcc/libstdc++-v3/libsupc++/new_op.cc:54
arrayfire/arrayfire#2  0x00007ffff576b9aa in __gnu_cxx::new_allocator<char>::allocate(unsigned long, void const*) (__n=<optimized out>, this=<optimized out>)
    at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/ext/new_allocator.h:115
arrayfire/arrayfire#3  std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (__capacity=281474137743303, 
    __capacity@entry=140737312585472, __old_capacity=<optimized out>, __alloc=...) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:1060
arrayfire/arrayfire#4  0x00007ffff576bb40 in std::string::_M_mutate(unsigned long, unsigned long, unsigned long) (this=0x7fffe6ff2e88, __pos=__pos@entry=0, __len1=__len1@entry=0, __len2=__len2@entry=0)
    at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:931
arrayfire/arrayfire#5  0x00007ffff576bd5c in std::string::_M_leak_hard() (this=0x7fffe6ff2e88) at /build/gcc/src/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:914
arrayfire/arrayfire#6  0x00007fffe492816a in std::__detail::_Compiler<std::regex_traits<char> >::_M_atom() () at /home/pradeep/arrayfire/lib64/libforge.so
arrayfire/arrayfire#7  0x00007fffe4928ae8 in std::__detail::_Compiler<std::regex_traits<char> >::_M_alternative() () at /home/pradeep/arrayfire/lib64/libforge.so
arrayfire/arrayfire#8  0x00007fffe4928cf9 in std::__detail::_Compiler<std::regex_traits<char> >::_M_disjunction() () at /home/pradeep/arrayfire/lib64/libforge.so
arrayfire/arrayfire#9  0x00007fffe492942b in std::__detail::_Compiler<std::regex_traits<char> >::_Compiler(char const*, char const*, std::locale const&, std::regex_constants::syntax_option_type) ()
    at /home/pradeep/arrayfire/lib64/libforge.so
arrayfire/arrayfire#10 0x00007fffe4929907 in std::enable_if<std::__detail::__is_contiguous_normal_iter<char const*>::value, std::shared_ptr<std::__detail::_NFA<std::regex_traits<char> > const> >::type std::__detail::__compile_nfa<char const*, std::regex_traits<char> >(char const*, char const*, std::regex_traits<char>::locale_type const&, std::regex_constants::syntax_option_type) ()
    at /home/pradeep/arrayfire/lib64/libforge.so
arrayfire/arrayfire#11 0x00007fffe490daff in _GLOBAL__sub_I_chart_impl.cpp () at /home/pradeep/arrayfire/lib64/libforge.so
arrayfire/arrayfire#12 0x00007ffff7fdbe8e in call_init.part () at /lib64/ld-linux-x86-64.so.2
arrayfire/arrayfire#13 0x00007ffff7fdbf78 in _dl_init () at /lib64/ld-linux-x86-64.so.2
arrayfire/arrayfire#14 0x00007ffff55d6b95 in _dl_catch_exception () at /usr/lib/libc.so.6
arrayfire/arrayfire#15 0x00007ffff7fe030a in dl_open_worker () at /lib64/ld-linux-x86-64.so.2
arrayfire/arrayfire#16 0x00007ffff55d6b38 in _dl_catch_exception () at /usr/lib/libc.so.6
arrayfire/arrayfire#17 0x00007ffff7fdfade in _dl_open () at /lib64/ld-linux-x86-64.so.2
arrayfire/arrayfire#18 0x00007ffff1fa934c in  () at /usr/lib/libdl.so.2
arrayfire/arrayfire#19 0x00007ffff55d6b38 in _dl_catch_exception () at /usr/lib/libc.so.6
arrayfire/arrayfire#20 0x00007ffff55d6c03 in _dl_catch_error () at /usr/lib/libc.so.6
arrayfire/arrayfire#21 0x00007ffff1fa9b89 in  () at /usr/lib/libdl.so.2
arrayfire/arrayfire#22 0x00007ffff1fa93d8 in dlopen () at /usr/lib/libdl.so.2
arrayfire/arrayfire#23 0x00007fff757e345a in common::loadLibrary(char const*) (library_name=<optimized out>) at ../src/backend/common/module_loading_unix.cpp:25

I am trying to explore some leads w.r.t this and figure out what is the best way to resolve this.

@9prady9
Copy link
Member

9prady9 commented May 6, 2021

Note that, none of this is an issue when the same set of shared libraries from the installer are used directly i.e. not from a octave mex function.

@9prady9
Copy link
Member

9prady9 commented May 8, 2021

@9prady9 9prady9 transferred this issue from arrayfire/arrayfire May 8, 2021
@9prady9 9prady9 added the bug label May 8, 2021
@9prady9 9prady9 added this to the 1.0.8 milestone May 8, 2021
@9prady9
Copy link
Member

9prady9 commented May 11, 2021

Found the root cause of this finally - https://bugzilla.redhat.com/show_bug.cgi?id=1546704

Note: Forge so file from ArrayFire Installer is built on CentOS using devtoolset-7.

@villekf
Copy link
Author

villekf commented Jun 9, 2022

Any update on this?

@9prady9 9prady9 modified the milestones: 1.0.8, 1.0.9 Sep 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants