Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Galactic release #167

Closed
doisyg opened this issue May 27, 2020 · 69 comments
Closed

Galactic release #167

doisyg opened this issue May 27, 2020 · 69 comments
Assignees

Comments

@doisyg
Copy link
Contributor

doisyg commented May 27, 2020

Builds fine under Noetic from branch melodic-devel but at execution I get:

terminate called after throwing an instance of 'pluginlib::LibraryLoadException'
what(): Failed to load library /home/ws/devel/lib//libspatio_temporal_voxel_layer.so. Make sure that you are calling the PLUGINLIB_EXPORT_CLASS macro in the library code, and that names are consistent between this macro and your XML. Error string: Could not load library (Poco exception = /lib/x86_64-linux-gnu/libjemalloc.so.2: cannot allocate memory in static TLS block)

Did somebody tried already and have a different result ?

@SteveMacenski
Copy link
Owner

I'm not seeing anything here that makes me suspect of pluginlib http://wiki.ros.org/noetic/Migration. The melodic devel branch looks identical to the noetic devel branch (https://github.com/ros/pluginlib).

Does the rospack get plugins call detect it properly?

I'd be more than happy to release it once we work through these issues.

@SteveMacenski SteveMacenski self-assigned this May 27, 2020
@doisyg
Copy link
Contributor Author

doisyg commented May 28, 2020

Yes:

$ rospack plugins --attrib=plugin costmap_2d 
spatio_temporal_voxel_layer /ws/src/spatio_temporal_voxel_layer/costmap_plugins.xml
costmap_2d /opt/ros/noetic/share/costmap_2d/costmap_plugins.xml

What's strange is that when preloading jemalloc with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2, the error disappears, but then it causes move_base to crash.
I will keep you posted if I have time to investigate.
Crash log (with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 before roslaunch):

[ WARN] [/move_base]: global_costmap: Pre-Hydro parameter "static_map" unused since "plugins" is provided
[ WARN] [/move_base]: global_costmap: Pre-Hydro parameter "map_type" unused since "plugins" is provided
[ INFO] [/move_base]: global_costmap: Using plugin "static_layer"
[ INFO] [/move_base]: Requesting the map...
[ INFO] [/move_base]: Resizing costmap to 200 X 200 at 0.050000 m/pix
[ INFO] [/move_base]: Received a 200 X 200 map at 0.050000 m/pix
[ INFO] [/move_base]: global_costmap: Using plugin "obstacle_layer"
[ INFO] [/move_base]:     Subscribed to Topics: laser_scan_sensor
[ INFO] [/move_base]: global_costmap: Using plugin "inflation_layer"
[ WARN] [/move_base]: local_costmap: Pre-Hydro parameter "static_map" unused since "plugins" is provided
[ WARN] [/move_base]: local_costmap: Pre-Hydro parameter "map_type" unused since "plugins" is provided
[ INFO] [/move_base]: local_costmap: Using plugin "static_layer"
[ INFO] [/move_base]: Requesting the map...
[ INFO] [/move_base]: Resizing static layer to 200 X 200 at 0.050000 m/pix
[ INFO] [/move_base]: Received a 200 X 200 map at 0.050000 m/pix
[ INFO] [/move_base]: local_costmap: Using plugin "obstacle_layer"
[ INFO] [/move_base]:     Subscribed to Topics: laser_scan_sensor
[ INFO] [/move_base]: local_costmap: Using plugin "rgbd_obstacle_layer"
[ INFO] [/move_base]: local_costmap/rgbd_obstacle_layer being initialized as SpatioTemporalVoxelLayer!
[ INFO] [/move_base]: local_costmap/rgbd_obstacle_layer's global frame is map.
[ INFO] [/move_base]: local_costmap/rgbd_obstacle_layer loaded parameters from parameter server.
[ INFO] [/move_base]: local_costmap/rgbd_obstacle_layer created underlying voxel grid.
[ INFO] [/move_base]: local_costmap/rgbd_obstacle_layer initialization complete!
[move_base-3] process has died [pid 402449, exit code -11, cmd /opt/ros/noetic/lib/move_base/move_base __name:=move_base __log:=/home/gd/.ros/log/50b524f2-a0bf-11ea-9306-9d8c2f1ba3e5/move_base-3.log].
log file: /home/gd/.ros/log/50b524f2-a0bf-11ea-9306-9d8c2f1ba3e5/move_base-3*.log


@SteveMacenski
Copy link
Owner

Running gdb to know where the crash happens would be useful. Might even be on movebase side if it crashes after the “complete” message

@doisyg
Copy link
Contributor Author

doisyg commented May 29, 2020

I assume you speak about the second issue, it looks like it is from the updateFootprint function:

[ INFO] [/move_base]: local_costmap/rgbd_obstacle_layer created underlying voxel grid.
[ INFO] [/move_base]: local_costmap/rgbd_obstacle_layer initialization complete!
[New Thread 0x7fffe65ce700 (LWP 69227)]
--Type <RET> for more, q to quit, c to continue without paging--

Thread 11 "move_base" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe65ce700 (LWP 69227)]
0x00007ffff1eaba97 in spatio_temporal_voxel_layer::SpatioTemporalVoxelLayer::updateFootprint(double, double, double, double*, double*, double*, double*) () from /ws/devel/lib//libspatio_temporal_voxel_layer.so

However I don't know if this second issue is linked or not to the first one and its workaround (preloading jemalloc)

@doisyg
Copy link
Contributor Author

doisyg commented May 29, 2020

If started with stvl param enabled: false or update_footprint_enabled : false (and still preloading jemalloc), it doesn't crash.
Then, at runtime (changing dyn param):
If enabled: true and update_footprint_enabled : false, no crash
Then if enabled: true and update_footprint_enabled : true => crash

@SteveMacenski
Copy link
Owner

What was the crash from it you had GDB up? What was the traceback? Maybe your footprint parameter wasn't read in correctly so one of the double pointers was null?

@SteveMacenski
Copy link
Owner

SteveMacenski commented May 29, 2020

It would be good to know where the error is in this function https://github.com/SteveMacenski/spatio_temporal_voxel_layer/blob/melodic-devel/src/spatio_temporal_voxel_layer.cpp#L477-L496

(since some of them are costmap calls, then maybe actually an error in costmap 2d. In fact, all those functions inside of this are provided by costmap 2d)

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

I am not fully understanding why, but because the updateFootprint function has no return value, the for loop was running way above _transformed_footprint.size() and overflowing

/*****************************************************************************/
bool SpatioTemporalVoxelLayer::updateFootprint(double robot_x, double robot_y, \
double robot_yaw, double* min_x,\
double* min_y, double* max_x, \
double* max_y)
/*****************************************************************************/
{
// updates layer costmap to include footprint for clearing in voxel grid
if (!_update_footprint_enabled)
{
return false;
}
costmap_2d::transformFootprint(robot_x, robot_y, robot_yaw, getFootprint(), \
_transformed_footprint);
for (unsigned int i = 0; i < _transformed_footprint.size(); i++)
{
touch(_transformed_footprint[i].x, _transformed_footprint[i].y, \
min_x, min_y, max_x, max_y);
}
}

Fixed in #168

@SteveMacenski
Copy link
Owner

Weird. Is noetic otherwise working?

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

Yes, provided that this line is added to .bashrc export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
I don't think it can be released until this issue is solved. It has apparently to do with the new jemalloc version in Focal: jemalloc/jemalloc#1237

@SteveMacenski
Copy link
Owner

is that a pluginlib issue? Is there a ticket filed for that so someone knows its an issue?

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

I found nothing else. How could you tell if it is pluginlib or not related ?

@SteveMacenski
Copy link
Owner

You said you're having issues loading plugins, unless you're saying that this is a unique issue to STVL. I have to assume your issues are the result of pluginlib issues.

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

Sorry if I was unclear, the only thing I know for sure is that when I start move_base with STVL, i get this crash:

[ INFO] [/move_base]: local_costmap: Using plugin "obstacle_layer"
[ INFO] [/move_base]:     Subscribed to Topics: laser_scan_sensor
[ INFO] [/move_base]: local_costmap: Using plugin "sonar_layer"
[ INFO] [/move_base]: local_costmap/sonar_layer: ALL as input_sensor_type given
[ INFO] [/move_base]: RangeSensorLayer: subscribed to topic /sonar
[ INFO] [/move_base]: local_costmap: Using plugin "rgbd_obstacle_layer"
terminate called after throwing an instance of 'pluginlib::LibraryLoadException'
  what():  Failed to load library /home/gd/elodie1_ws/devel/lib//libspatio_temporal_voxel_layer.so. Make sure that you are calling the PLUGINLIB_EXPORT_CLASS macro in the library code, and that names are consistent between this macro and your XML. Error string: Could not load library (Poco exception = /lib/x86_64-linux-gnu/libjemalloc.so.2: cannot allocate memory in static TLS block)

whereas without it, I don't have this issue (and other plugins like costmap_2d::StaticLayer, costmap_2d::ObstacleLayer, range_sensor_layer::RangeSensorLayer are loading fine).
Now, if it is a STVL or a pluginlib issue, I have no clue.
I dug a bit and found out that the issue disappears by using export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 before launching move_base, that may help diagnostic the issue for somebody familiar with the dynamic library and memory allocation system (which I am not)

@SteveMacenski
Copy link
Owner

SteveMacenski commented Jun 1, 2020

What's the computer you're trying to run it on? Can you verify on another hardware machine it works or doesn't? I'm not sure what to do about that myself.

@SteveMacenski
Copy link
Owner

I'd also be curious if you ran into issues loading NPVL or if its just STVL https://github.com/SteveMacenski/nonpersistent_voxel_layer

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

What's the computer you're trying to run it on? Can you verify on another hardware machine it works or doesn't? I'm not sure what to do about that myself.

Same result on my 3 years old asus laptop, a NUC8 and a NUC10.

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

I'd also be curious if you ran into issues loading NPVL or if its just STVL https://github.com/SteveMacenski/nonpersistent_voxel_layer

I ll try

@SteveMacenski
Copy link
Owner

Ok, so probably not platform specific.

It may (?) be released, but rosdistro doesn't have a Focal entry for openVDB https://github.com/ros/rosdistro/blob/master/rosdep/base.yaml#L3005

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

I'd also be curious if you ran into issues loading NPVL or if its just STVL https://github.com/SteveMacenski/nonpersistent_voxel_layer

I ll try

No issue with NPVL

@SteveMacenski
Copy link
Owner

Mhm, that is interesting then. I don't have a 20.04 machine yet to try to debug this. Hopefully in the next few weeks but for the moment, there's not much I can do.

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

Ok, so probably not platform specific.

It may (?) be released, but rosdistro doesn't have a Focal entry for openVDB https://github.com/ros/rosdistro/blob/master/rosdep/base.yaml#L3005

ros/rosdistro#25257

@doisyg
Copy link
Contributor Author

doisyg commented Jun 1, 2020

Mhm, that is interesting then. I don't have a 20.04 machine yet to try to debug this. Hopefully in the next few weeks but for the moment, there's not much I can do.

No rush for a release, I ll test it on a real robot hopefully in the next days and report back if I find any other issues

@SteveMacenski
Copy link
Owner

Ok, that issue is really odd though, I'd suggest filing a ticket for it on pluginlib repo since something broke during the last distribution update. As far as Ican tell, nothing should need to be changed http://wiki.ros.org/noetic/Migration

@sloretz
Copy link

sloretz commented Jun 3, 2020

Ok, that issue is really odd though, I'd suggest filing a ticket for it on pluginlib repo since something broke during the last distribution update. As far as Ican tell, nothing should need to be changed http://wiki.ros.org/noetic/Migration

It doesn't seem like this issue is coming from pluginlib. The libjemalloc2 dependency comes from the Debian Package libopenvdb6.2

<run_depend>libopenvdb-dev</run_depend>
<run_depend>libopenvdb</run_depend>

https://packages.debian.org/sid/libopenvdb6.2
https://packages.ubuntu.com/focal/libopenvdb6.2

@doisyg pointed to an issue suggesting jemalloc is not being built with --disable-initial-exec-tls? It looks like there is a bug about this on the Debian bug tracker, so the Ubuntu Focal version probably has the same problem: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=951704

Commenting there, or opening another bug on launchpad might be good next steps: https://launchpad.net/ubuntu/+source/jemalloc/+bugs

@SteveMacenski
Copy link
Owner

Would this not be something to file with the openvdb maintainers on GitHub since they just need to change some flags and rebuild debians?

I'm not very encouraged by the likelihood of this being fixed given that ticket hasn't had motion since Feb.

@SteveMacenski
Copy link
Owner

I filed the ticket above on OpenVDB to make them aware. Is there any other action here I can take?

@doisyg
Copy link
Contributor Author

doisyg commented Jun 9, 2020

Thanks for filling the ticket, I reported there that the issue disappears when installing openvdb7.0 instead of openvdb6.2 before recompiling. The problem now being that openvdb7.0 is not officially available on focal

@SteveMacenski SteveMacenski added the wontfix This will not be worked on label Apr 12, 2021
@SteveMacenski SteveMacenski changed the title Noetic / Foxy release ? Noetic / Foxy / Galactic release ? May 26, 2021
miltzhaw added a commit to icclab/icclab_summit_xl that referenced this issue Jun 11, 2021
miltzhaw pushed a commit to icclab/rosdocked-irlab that referenced this issue Jun 11, 2021
@HappySamuel
Copy link

Hi

Any noetic release of STVL (install via apt) ?

Best,
Samuel

@doisyg
Copy link
Contributor Author

doisyg commented Dec 1, 2021

Looks like it is finally going to get resolved for Ubuntu: https://bugs.launchpad.net/ubuntu/+source/openvdb/+bug/1882998
I will test and report here if nobody do it first

@SteveMacenski
Copy link
Owner

Freckin' awesome! I can release binaries after this is out and about. Let me know! I thought I filed a ticket on that too and was roundly denied (though honestly I haven't used launchpad before and I'm sure I didn't follow convention)

@SteveMacenski
Copy link
Owner

SteveMacenski commented Dec 1, 2021

Funny enough, I was just thinking about updating Nav2 to use OpenVDB as the default voxel grid implementation so that we can support arbitrary heighted robots, not being limited by 16 voxel heights. I'm still not entirely sure on the memory footprint, but would be worth a try. In the early days of STVL I had actually used OpenVDB's raytracing and replicated behavior like the Voxel Layer -- though wasn't any faster since this doesn't make use of signed distance fields (and still had the discrete raycasting "ghost" voxels issue that was the root of what I was trying to solve).

If we can get binaries out and working + 22.04 has the fix too, then I'd be alot more confident in starting in that direction. I see no reason why Nav2 should have its own voxel_grid implementation - there's got to be a bunch of standard library options for that these days. OpenVDB is the one I know of well so I'd start with that, unless there's other reasonable alternatives.

@doisyg
Copy link
Contributor Author

doisyg commented Dec 5, 2021

The focal "proposed" update package fixes the issue! I reported it and it should be pushed -updated next (I will report here when it is).

Funny enough, I was just thinking about updating Nav2 to use OpenVDB as the default voxel grid implementation so that we can support arbitrary heighted robots, not being limited by 16 voxel heights. I'm still not entirely sure on the memory footprint, but would be worth a try. In the early days of STVL I had actually used OpenVDB's raytracing and replicated behavior like the Voxel Layer -- though wasn't any faster since this doesn't make use of signed distance fields (and still had the discrete raycasting "ghost" voxels issue that was the root of what I was trying to solve).

If we can get binaries out and working + 22.04 has the fix too, then I'd be alot more confident in starting in that direction. I see no reason why Nav2 should have its own voxel_grid implementation - there's got to be a bunch of standard library options for that these days. OpenVDB is the one I know of well so I'd start with that, unless there's other reasonable alternatives.

Great news! OpenVDB still seems to be the best choice, though @facontidavide is working on something even more efficient : https://github.com/facontidavide/Bonxai

@SteveMacenski
Copy link
Owner

So you did just sudo apt install libopenvdb* and used those binaries alone? Can you confirm @nickovaras (sorry, I'm actually not using this right now on anything so difficult for me to test confidently).

If so, I can release it ASAP this week.

Reopening, since we could resolve. Then I'll announce on discourse when released.

@SteveMacenski SteveMacenski reopened this Dec 6, 2021
@doisyg
Copy link
Contributor Author

doisyg commented Dec 6, 2021

So you did just sudo apt install libopenvdb* and used those binaries alone? Can you confirm @nickovaras (sorry, I'm actually not using this right now on anything so difficult for me to test confidently).

No, I had to enabled "proposed" updates in Ubuntu: https://wiki.ubuntu.com/Testing/EnableProposed
Then apt update + sudo apt install libopenvdb-dev

If so, I can release it ASAP this week.

You will be able, once the proposed pacakage is pushed to regular updates (I guess in ~ 7 days), I ll let you know here when that's the case

@SteveMacenski
Copy link
Owner

Got it, let me know!

@doisyg
Copy link
Contributor Author

doisyg commented Dec 8, 2021

libopenvdb-dev 6.2.1-8ubuntu1.1 released ! (previous was 6.2.1-8ubuntu1)

@SteveMacenski
Copy link
Owner

SteveMacenski commented Dec 8, 2021

I assume that's verification that this resolves the issue with the new binaries? Now, @nickovaras / @nickvaras can you test using binaries? 😄

I'll be happy to release across the board once I have just another verification.

Steps

  • Verification
  • Release to Galactic, Foxy, Noetic
  • Announce on Discourse / LinkedIn / Slack w/ link to paper and repo

@SteveMacenski SteveMacenski removed the wontfix This will not be worked on label Dec 13, 2021
@SteveMacenski SteveMacenski changed the title Noetic / Foxy / Galactic release ? Galactic release Dec 15, 2021
@SteveMacenski
Copy link
Owner

SteveMacenski commented Dec 15, 2021

The galactic release is currently held up for adding OpenVDB keys for RHEL ros/rosdistro#31488

@SteveMacenski
Copy link
Owner

Galactic: ros/rosdistro#31490

@nickvaras
Copy link

I assume that's verification that this resolves the issue with the new binaries? Now, @nickovaras / @nickvaras can you test using binaries? smile

I'll be happy to release across the board once I have just another verification.

Steps

* [x]  Verification

* [x]  Release to Galactic, Foxy, Noetic

* [x]  Announce on Discourse / LinkedIn / Slack w/ link to paper and repo

Noetic version seems happy with new openvdb binaries.

@SteveMacenski
Copy link
Owner

Thanks!

@tonynajjar
Copy link
Contributor

tonynajjar commented Feb 9, 2022

Anyone can confirm that the galactic binaries work? In my application, building stvl on galactic branch works fine, as soon as I switch to binaries, it stops working. I know that's not much info, I haven't had time yet to simplify/reduce the setup to reproducible steps to give you more info but before I do so, it would be useful to know if it already works for someone else; then it would most likely be a local issue.

Maybe that's already useful: I get this log in a constant loop

[planner_server-13] [DEBUG] [1644418859.299071301] [agv2.planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Received request for goal acceptance
[planner_server-13] [DEBUG] [1644418859.299331738] [agv2.planner_server_rclcpp_node.rclcpp_action]: Accepted goal 567fd88d1c72b477d7de99b5248bfb
[planner_server-13] [DEBUG] [1644418859.299459897] [agv2.planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Receiving a new goal
[planner_server-13] [DEBUG] [1644418859.299524779] [agv2.planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Executing goal asynchronously.
[planner_server-13] [DEBUG] [1644418859.299679267] [agv2.planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Executing the goal...
[planner_server-13] [DEBUG] [1644418859.300665702] [agv2.planner_server]: Attempting to a find path from (12.73, -90.02) to (14.21, -94.76).
[planner_server-13] [DEBUG] [1644418859.301190906] [agv2.planner_server]: Found valid path of size 36 to (14.21, -94.76)
[planner_server-13] [DEBUG] [1644418859.301430042] [agv2.planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Setting succeed on current goal.
[planner_server-13] [DEBUG] [1644418859.301680142] [agv2.planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Blocking processing of new goal handles.
[planner_server-13] [DEBUG] [1644418859.301794542] [agv2.planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Done processing available goals.
[planner_server-13] [DEBUG] [1644418859.301883938] [agv2.planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Worker thread done.
[controller_server-9] [DEBUG] [1644418859.319896006] [agv2.controller_server_rclcpp_node]: [follow_path] [ActionServer] Received request for goal acceptance
[controller_server-9] [DEBUG] [1644418859.320082481] [agv2.controller_server_rclcpp_node.rclcpp_action]: Accepted goal 95f59bfc6bfd285dc0332e5ba0ddddc
[controller_server-9] [DEBUG] [1644418859.320197859] [agv2.controller_server_rclcpp_node]: [follow_path] [ActionServer] Receiving a new goal
[controller_server-9] [DEBUG] [1644418859.320259808] [agv2.controller_server_rclcpp_node]: [follow_path] [ActionServer] An older goal is active, moving the new goal to a pending slot.
[controller_server-9] [DEBUG] [1644418859.320426379] [agv2.controller_server_rclcpp_node]: [follow_path] [ActionServer] The pending slot is occupied. The previous pending goal will be terminated and replaced.
[controller_server-9] [WARN] [1644418859.320496010] [agv2.controller_server_rclcpp_node]: [follow_path] [ActionServer] Aborting handle.

STVL does report as being activated

I 0:00:05:873 [agv2.local_costmap.local_costmap::activate] stvl_layer was activated.

The /voxel_grid topic is not published.

If this is not the right place to ask this, I can create a separate issue

@SteveMacenski
Copy link
Owner

SteveMacenski commented Feb 9, 2022

Do you see any other errors, especially around libjemalloc errors? The git history for both the binaries and the current galactic branch are identical, so the only thing I could think of is the build environment.

It doesn't seem to me that those are related error messages unless you're not showing other error messages where there are things that crashed or weren't fully loaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests