Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.Library can easily cause segfault on loading/unloading #125234

Open
ezyang opened this issue Apr 30, 2024 · 2 comments
Open

torch.Library can easily cause segfault on loading/unloading #125234

ezyang opened this issue Apr 30, 2024 · 2 comments
Labels
module: library Related to torch.library (for registering ops from Python) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ezyang
Copy link
Contributor

ezyang commented Apr 30, 2024

馃悰 Describe the bug

When a segfault occurs, it looks something like this:

#0  std::basic_streambuf<char, std::char_traits<char> >::xsputn (this=0x7fffffff7c18, __s=0x2 <error: Cannot access memory at address 0x2>, __n=7484960)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:90
#1  0x00007fffedd22d25 in std::__ostream_write<char, std::char_traits<char> > (__out=..., __s=<optimized out>, __n=7484960)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_ios.h:321
#2  0x00007fffedd22df0 in std::__ostream_insert<char, std::char_traits<char> > (__out=..., __s=0x2 <error: Cannot access memory at address 0x2>, __n=7484960)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1685813977163/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_ios.h:180
#3  0x00007fffe23adcc9 in c10::operator<<(std::ostream&, c10::OperatorName const&) () from /data/users/ezyang/b/pytorch/torch/lib/libtorch_cpu.so
#4  0x00007fffed0c0c0f in std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<c10::FunctionSchema const&> (*)(c10::detail::overloaded_t<torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::C10Operator const&)#1}, torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::JitOnlyOperator const&)#2}>&&, std::variant<torch::jit::Operator::C10Operator, torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::C10Operator const&)#1}> const&)>, std::integer_sequence<unsigned long, 0ul> >::__visit_invoke(torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::JitOnlyOperator const&)#2}, std::variant<torch::jit::Operator::C10Operator, torch::jit::Operator::schema() const::{lambda(torch::jit::Operator::C10Operator const&)#1}>) ()
   from /data/users/ezyang/b/pytorch/torch/lib/libtorch_python.so
#5  0x00007fffed1943f9 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optional<c10::DispatchKey>) ()
   from /data/users/ezyang/b/pytorch/torch/lib/libtorch_python.so
#6  0x00007fffed07a5ce in pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#210}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args, pybind11::kwargs)#1}, pybind11::object, {lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#210}, pybind11::args, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#210}::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const::{lambda(pybind11::args, pybind11::kwargs)#1}&&, pybind11::object (*)({lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)#210}, pybind11::args), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail) () from /data/users/ezyang/b/pytorch/torch/lib/libtorch_python.so
#7  0x00007fffecc5a282 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /data/users/ezyang/b/pytorch/torch/lib/libtorch_python.so
#8  0x00000000004fc697 in cfunction_call (func=0x7fff37fd3060, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.13/Objects/methodobject.c:543

Unfortunately, I was unable to make a self contained reproducer. There seems to be something funny that pytest is doing with lifetimes that is causing the problem. To reproduce:

  1. Patch in Add propagate_real_tensors mode for unbacked聽#125115
  2. Replace test_lib._destroy() call with pass
  3. We need to run two tests in order back-to-back. It does not work to merge the tests into a single test (as there is some interaction with pytest). This recipe works: pip install detect-test-pollution, create a testids.txt file with contents:
test/test_fake_tensor.py::FakeTensorTest::test_custom_op_fallback
test/test_fake_tensor.py::PropagateRealTensorsFakeTensorTest::test_custom_op_fallback_propagate_real_tensors

and then run the test with pytest test/test_fake_tensor.py -p detect_test_pollution --dtp-testids-input-file testids.txt

As you can see above, explicitly destroying the library object fixes the problem. This is fairly reminiscent of typical pytest pathology, where pytest keeps things alive longer than they should be. But it is not only keeping things alive, because if you intentionally leak the test_lib object (e.g., by assigning it to a global), you get this error as you were hoping for:

Traceback (most recent call last):
  File "/data/users/ezyang/b/pytorch/test/test_fake_tensor.py", line 92, in test_custom_op_fallback
    test_lib = Library("my_test_op", "DEF")  # noqa: TOR901
  File "/data/users/ezyang/b/pytorch/torch/library.py", line 70, in __init__
    self.m: Optional[Any] = torch._C._dispatch_library(kind, ns, dispatch_key, filename, lineno)
RuntimeError: Only a single TORCH_LIBRARY can be used to register the namespace my_test_op; please put all of your definitions in a single TORCH_LIBRARY block.  If you were trying to specify implementations, consider using TORCH_LIBRARY_IMPL (which can be duplicated).  If you really intended to define operators for a single namespace in a distributed way, you can use TORCH_LIBRARY_FRAGMENT to explicitly indicate this.  Previous registration of TORCH_LIBRARY was registered at /dev/null:2757; latest registration was registered at /dev/null:309

Very puzzling!

I'm not going to investigate the problem any further, but I will note that explicit deallocation solves the problem, so we should probably prophylactically update all of our torch.Library use in test suite to explicitly use a scoped handler which will handle deallocation explicitly.

Versions

main

cc @anjali411

@ezyang
Copy link
Contributor Author

ezyang commented Apr 30, 2024

cc @zou3519

@cpuhrsch cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: library Related to torch.library (for registering ops from Python) labels Apr 30, 2024
@zou3519
Copy link
Contributor

zou3519 commented May 21, 2024

probably prophylactically update all of our torch.Library use in test suite to explicitly use a scoped handler which will handle deallocation explicitly.

There is a scoped handler, torch.library._scoped_library. @kit1980 has a lint rule for this and hopefully we'll be able to codemod old usages away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: library Related to torch.library (for registering ops from Python) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants