Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python operations on shape nodes crash #184

Open
peabody124 opened this issue Jun 20, 2023 · 9 comments
Open

Python operations on shape nodes crash #184

peabody124 opened this issue Jun 20, 2023 · 9 comments

Comments

@peabody124
Copy link
Contributor

I've tracked down a regression in the python bindings that causes a segafult when nimblephysics when from 0.8.77 to 0.8.78

This can be easily reproduced with:

import nimblephysics as nimble

skeleton = nimble.RajagopalHumanBodyModel().skeleton

nodes = skeleton.getBodyNodes()
shapeNode = nodes[0].getShapeNode(0)
shapeNode.getName()

it also occurs with getShape()

This can be reproduced in the devcontainer, too.

I'm pretty sure this change https://github.com/keenon/nimblephysics/pull/105/files#diff-a5184a1951f89b1f86b8b45fd5157f4bdb0be20ea8c62e8fefa74cc2c28371dc or the similar one in ShapeFrame is related, but I'm not certain how

@keenon
Copy link
Owner

keenon commented Jun 20, 2023

My rough guess here is that in this example, the actual C++ structure for skeleton goes out of scope right after you stop referencing nodes, and then gets freed. Then shapeNode, which since the refactor you pointed to is no longer a shared_ptr, but just a regular plain-vanilla pointer, is a pointer to freed memory (if it were a shared_ptr, it would have prevented the skeleton from getting freed). Then calling shapeNode.getName() is likely to crash (but probably won't always crash on every computer and OS, depending on how their underlying memory management works).

If my hunch about root cause is right, then this may be a tricky bug to solve, because it would probably require bringing back shared_ptr in the definition of the ShapeNode in our pybind11 bindings (though there's hope that perhaps we could use pybind11's reference_internal, maybe?). The trouble with that is that (because of pybind11 limitations that I don't remember the exact details of) doing that breaks our ability to generate *.pyi stubs, which breaks autocomplete in PyCharm.

@peabody124
Copy link
Contributor Author

@keenon I don't think it's as simple as that. I found this in code where it keeps the skeleton around much longer. I simply can never access getName or getShape without a segfault.

Thanks for the suggestion, I'm going to look into trying reference_internal.

@keenon
Copy link
Owner

keenon commented Jun 21, 2023

Good luck! I'd highly recommend reading the docs on return value policies in pybind11 (https://pybind11.readthedocs.io/en/stable/advanced/functions.html#return-value-policies), and then looking at the corresponding Python bindings (for example, the binding for Skeleton here: https://github.com/keenon/nimblephysics/blob/master/python/_nimblephysics/dynamics/Skeleton.cpp)

@peabody124
Copy link
Contributor Author

Thanks, looks like a good reference.

I'm finding quite an odd pattern of crashes using the shapeNodes. I also found even calling getBodyNodes() and letting the code end would crash (perhaps that has been changed on a more recent block of code, but I'm working near where the regression occurrred). Hopefully this can also help me figure out how to avoid needing to manually delete the fitMarkers to avoid segfaults from that, too.

import nimblephysics as nimble

skeleton = nimble.RajagopalHumanBodyModel().skeleton

nodes = skeleton.getBodyNodes()
# just calling this and letting the program terminate causes a crash if
# make getBodyNodes use reference_internal

shapeNode = nodes[0].getShapeNode(1)

print(type(shapeNode))
print(dir(shapeNode))
print(shapeNode.getShapeNodeProperties())  # on ShapeNode, no crash
print(shapeNode.getRelativeRotation())     # on ShapeNode, no crash

print(shapeNode.getShape())             # method on ShapeFrame, crashes
print(shapeNode.isShapeNode())           # method on ShapeFrame, no crash

#print(shapeNode.getRelativeTransform()) # method on Frame, causes a crash
print(shapeNode.getLinearVelocity())     # method on Frame, no crash
print(shapeNode.getTransform())          # method on Frame, no crash

#print(shapeNode.getParentFrame())       # method on Entity, crash
#print(shapeNode.getName())              # method on Entity, crash
print(shapeNode.isFrame())               # method on Entity, no crash
#shapeNode.setName("Hello")              # method on Entity, crash

@peabody124
Copy link
Contributor Author

peabody124 commented Jun 21, 2023

@keenon so I spent a while trying to see if anything about the return type around the methods would prevent the crash and had no success.

I have two possible solutions that fix the crash. The first is reverting the move away from shared_ptr:
master...peabody124:nimblephysics:revert_shared_ptr

This seems the most correct to me. Dart still has all of these classes defined as shared_ptr. I can't really follow the logic of why this behavior was changed, so perhaps I am missing some context. On this branch, all of the calls above go through without crashing.

Another solution that is less elegant but more compact is here master...peabody124:nimblephysics:get_shape_crash_fix which basically exposes a new method that gets the shape directly from the body node and doesn't go through python to trigger bad garbage collection.

Thoughts? Happy to submit a PR.

p.s., any thoughts about adding python unit tests?

@keenon
Copy link
Owner

keenon commented Jun 21, 2023

Here's a fix that works on my local machine, without rollback: #186

It turned out there are two issues. One is an incorrect handling of ownership of getBodyNodes() in the Python bindings, but another was an issue with legacy multiple-inheritance from DART. (Personally, all the multiple-inheritance in the code base is something I would never have voluntarily done). The issue here is that if we create the Pybind11 bindings with the multiple-inheritance explicit in them, then Pybind11 will generate class names that include C++ syntax. That C++ syntax then breaks IDEs (like PyCharm) that are attempting to parse the method signatures on Nimble's native library, and so then you lose autocomplete for users of the Python library (which is really unacceptable). But if you specify a simplified (and not entirely accurate) inheritance tree to Pybind11 where each class only inherits from one parent, when you attempt to call the parent's methods without an explicit re-binding in Pybind11's definition of the child class, you crash on a segfault. ShapeNode is one such example. I don't want to revert the refactor you pointed to, because that's what fixed autocomplete for users of the Python library, and I don't want to refactor the entire C++ codebase to get rid of multiple inheritance (because that would touch hundreds of files and hundreds of tests and I would never really trust I hadn't introduced a bunch of bugs in the refactor), and so the "solution" (sad as it is) is to add explicit bindings where they're missing.

@peabody124
Copy link
Contributor Author

Thanks. That fixes the main issues I was concerned about. There are still two crashes in my test cases using 0.9.10 (that weren't there on the revert branch). Both of these commented lines below do it:

import time
import nimblephysics as nimble

skeleton = nimble.RajagopalHumanBodyModel().skeleton

nodes = skeleton.getBodyNodes()

print(nodes[0].getTransform())         # method on BodyNode, no crash
print(nodes[0].getName())              # method on BodyNode, no crash

shapeNode = nodes[0].getShapeNode(1)

print(type(shapeNode))
print(dir(shapeNode))
print(shapeNode.getShapeNodeProperties())  # on ShapeNode, no crash
print(shapeNode.getRelativeRotation())     # on ShapeNode, no crash

print(shapeNode.getShape())             # method on ShapeFrame, no crash
print(shapeNode.isShapeNode())           # method on ShapeFrame, no crash


#print(shapeNode.getRelativeTransform())  # method on Frame, causes a crash
print(shapeNode.getLinearVelocity())     # method on Frame, no crash
print(shapeNode.getTransform())          # method on Frame, no crash

#print(shapeNode.getParentFrame())       # method on Entity, crash
print(shapeNode.getName())              # method on Entity, no crash
print(shapeNode.isFrame())               # method on Entity, no crash

@peabody124
Copy link
Contributor Author

Also while bringing these things up I've also found this code segfaults without the del fitMarkers

        fitMarkers = results[0].updatedMarkerMap
        marker_offsets_map = {}
        for k in fitMarkers:
            v = fitMarkers[k]
            marker_offsets_map[k] = (v[0].getName(), v[1])
        # del fitMarkers

@peabody124
Copy link
Contributor Author

@keenon I found another related bug:

skeleton = nimble.RajagopalHumanBodyModel().skeleton

for b in skeleton.getBodyNodes():
    n = b.getNumShapeNodes()
    for i in range(n):
        s = b.getShapeNode(i)
        name = s.getName()
        print(i, name)
        shape = s.getShape()
        print("shape", shape)
        scale = shape.getScale()
        print("scale", scale)

Also crashes after a few nodes. It seems to crash after getting the name before printing the shape.

I'm wondering from extrapolating from your prior fix if a getShape needs to be defined here?
https://github.com/keenon/nimblephysics/pull/186/files#diff-a5184a1951f89b1f86b8b45fd5157f4bdb0be20ea8c62e8fefa74cc2c28371dcR110

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants