torch.Library can easily cause segfault on loading/unloading #125234

ezyang · 2024-04-30T14:53:35Z

🐛 Describe the bug

When a segfault occurs, it looks something like this:

#0  std::basic_streambuf<char, std::char_traits<char> >::xsputn (this=0x7fffffff7c18, __s=0x2 <error: Cannot access memory at address 0x2>, __n=7484960)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:90
#1  0x00007fffedd22d25 in std::__ostream_write<char, std::char_traits<char> > (__out=..., __s=<optimized out>, __n=7484960)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_ios.h:321
#2  0x00007fffedd22df0 in std::__ostream_insert<char, std::char_traits<char> > (__out=..., __s=0x2 <error: Cannot access memory at address 0x2>, __n=7484960)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_ios.h:180
#3  0x00007fffe23adcc9 in c10::operator<<(std::ostream&, c10::OperatorName const&) () from /data/users/ezyang/b/pytorch/torch/lib/libtorch_cpu.so
#4  0x00007fffed0c0c0f in std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<c10::FunctionSchema const&> (*)(c10::detail::overloaded_t<torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::C10Operator const&)#1}, torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::JitOnlyOperator const&)#2}>&&, std::variant<torch::jit::Operator::C10Operator, torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::C10Operator const&)#1}> const&)>, std::integer_sequence<unsigned long, 0ul> >::__visit_invoke(torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::JitOnlyOperator const&)#2}, std::variant<torch::jit::Operator::C10Operator, torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::C10Operator const&)#1}>) ()
   from /data/users/ezyang/b/pytorch/torch/lib/libtorch_python.so
#5  0x00007fffed1943f9 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) ()
   from /data/users/ezyang/b/pytorch/torch/lib/libtorch_python.so
#6  0x00007fffed07a5ce in pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#210}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args, pybind11::kwargs)#1}, pybind11::object, {lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#210}, pybind11::args, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#210}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args, pybind11::kwargs)#1}&&, pybind11::object (*)({lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#210}, pybind11::args), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail) () from /data/users/ezyang/b/pytorch/torch/lib/libtorch_python.so
#7  0x00007fffecc5a282 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /data/users/ezyang/b/pytorch/torch/lib/libtorch_python.so
#8  0x00000000004fc697 in cfunction_call (func=0x7fff37fd3060, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.13/Objects/methodobject.c:543

Unfortunately, I was unable to make a self contained reproducer. There seems to be something funny that pytest is doing with lifetimes that is causing the problem. To reproduce:

Patch in Add propagate_real_tensors mode for unbacked #125115
Replace test_lib._destroy() call with pass
We need to run two tests in order back-to-back. It does not work to merge the tests into a single test (as there is some interaction with pytest). This recipe works: pip install detect-test-pollution, create a testids.txt file with contents:

test/test_fake_tensor.py::FakeTensorTest::test_custom_op_fallback
test/test_fake_tensor.py::PropagateRealTensorsFakeTensorTest::test_custom_op_fallback_propagate_real_tensors

and then run the test with pytest test/test_fake_tensor.py -p detect_test_pollution --dtp-testids-input-file testids.txt

As you can see above, explicitly destroying the library object fixes the problem. This is fairly reminiscent of typical pytest pathology, where pytest keeps things alive longer than they should be. But it is not only keeping things alive, because if you intentionally leak the test_lib object (e.g., by assigning it to a global), you get this error as you were hoping for:

Traceback (most recent call last):
  File "/data/users/ezyang/b/pytorch/test/test_fake_tensor.py", line 92, in test_custom_op_fallback
    test_lib = Library("my_test_op", "DEF")  # noqa: TOR901
  File "/data/users/ezyang/b/pytorch/torch/library.py", line 70, in __init__
    self.m: Optional[Any] = torch._C._dispatch_library(kind, ns, dispatch_key, filename, lineno)
RuntimeError: Only a single TORCH_LIBRARY can be used to register the namespace my_test_op; please put all of your definitions in a single TORCH_LIBRARY block.  If you were trying to specify implementations, consider using TORCH_LIBRARY_IMPL (which can be duplicated).  If you really intended to define operators for a single namespace in a distributed way, you can use TORCH_LIBRARY_FRAGMENT to explicitly indicate this.  Previous registration of TORCH_LIBRARY was registered at /dev/null:2757; latest registration was registered at /dev/null:309

Very puzzling!

I'm not going to investigate the problem any further, but I will note that explicit deallocation solves the problem, so we should probably prophylactically update all of our torch.Library use in test suite to explicitly use a scoped handler which will handle deallocation explicitly.

Versions

main

cc @anjali411

The text was updated successfully, but these errors were encountered:

ezyang · 2024-04-30T14:53:41Z

cc @zou3519

zou3519 · 2024-05-21T13:47:20Z

probably prophylactically update all of our torch.Library use in test suite to explicitly use a scoped handler which will handle deallocation explicitly.

There is a scoped handler, torch.library._scoped_library. @kit1980 has a lint rule for this and hopefully we'll be able to codemod old usages away.

cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: library Related to torch.library (for registering ops from Python) labels Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.Library can easily cause segfault on loading/unloading #125234

torch.Library can easily cause segfault on loading/unloading #125234

ezyang commented Apr 30, 2024 •

edited by pytorch-bot bot

ezyang commented Apr 30, 2024

zou3519 commented May 21, 2024

torch.Library can easily cause segfault on loading/unloading #125234

torch.Library can easily cause segfault on loading/unloading #125234

Comments

ezyang commented Apr 30, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

ezyang commented Apr 30, 2024

zou3519 commented May 21, 2024

ezyang commented Apr 30, 2024 •

edited by pytorch-bot bot