Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calling trace_attributes continuously is gradually accumulating memory #4667

Closed
mkandulavm opened this issue Mar 31, 2024 · 23 comments
Closed

Comments

@mkandulavm
Copy link

mkandulavm commented Mar 31, 2024

Hi

I am facing a issue where the trace_attributes calls are gradually increasing the RAM used overtime (for the same map region , each call made synchronously).
I have even set the auto_cleanup boolean to true for the actor object.
It seems the memory is not getting released somewhere and is being held by the object. Is this issue known or any similar instance was reported ?

Furthermore, when the calls are made asynchronously (where I use the parent actor's graphreader for the cloned actors), I was getting a segmentation fault after some time if the ENABLE_THREAD_SAFE_TILE_REF_COUNT is OFF.

when ENABLE_THREAD_SAFE_TILE_REF_COUNT is ON and i use more threads the accumulation is much faster.

I am trying to find more information and will revert if possible.

Thank you.

@mkandulavm mkandulavm changed the title calling trace_attributes continously is gradually accumulating memory calling trace_attributes continuously is gradually accumulating memory Mar 31, 2024
@nilsnolde
Copy link
Member

might very well be related to #4322. but yeah, we heard this before. kevin suggests it's a tile cache clear omission, but no time to look into that.

@mkandulavm
Copy link
Author

mkandulavm commented Apr 4, 2024

Hi @nilsnolde

thank you for the info.

I ran my test with debug + valgrind with massif. I am attaching the file for reference.

Here is my observation.
The memory allocation happening at MapMatcher.cc (around line 880) is
const auto& candidates = candidatequery_.Query(measurement.lnglat(), measurement.stop_type(), sq_radius, costing());

Above code's memory is consistently increasing as per the visualisation from the massif file. Possibly some memory control is required at this place ?

Kindly let me know if this is helping so I can continue debugging.

massif.zip

@kevinkreiser
Copy link
Member

yep that is the place i expected it. there is a secondary cache there that we use to find location candidates for the input coordinates and i suspect we just dont clear it ever. @mkandulavm give me a sec and ill throw up a branch you can test to see if it fixes it

@kevinkreiser
Copy link
Member

ok it looks like cache clearing is properly implemented both for the command line tool and for the regular worker path see:

matcher_factory.ClearFullCache();

candidatequery_->Clear();

return grid_cache_.size();
}
void Clear() {
grid_cache_.clear();

first question if you set cache_size very small for this value in your valhalla.json:

'grid': {'size': 500, 'cache_size': 100240},

does it fix the problem for you? perhaps you are using the default and your machine is just not well matched to that?

@mkandulavm
Copy link
Author

I will try with lower numbers, but should'nt cache_size be in order of bytes ? It seems its being checked with the map size.

@kevinkreiser
Copy link
Member

yeah it uses a count to control the cache, so yes the units are not in bytes. the number of bytes of ram used is a function of the number of items but its non trivial to count the exact number of bytes this uses. there are many other items that control cache by item count rather than bytes. the code is as intended

@mkandulavm
Copy link
Author

mkandulavm commented Apr 4, 2024

ok. Just one more clarification, should I leave the grid size as 500 ? will reducing it affect any computation ?

I will run the test and share the massif capture to verify.

@kevinkreiser
Copy link
Member

reducing grid size will use less ram but it will make map matching slower

@mkandulavm
Copy link
Author

the problem seems rectified at my end with grid.cache_size = 100
I am attaching the massif file (both tests ran for ~3 hours)
I do understand that the cache_size is meant to be estimate of certain average memory * map size, but the number seems pretty large for the default being currently set.

massif-grid.cache_size_100.zip

@kevinkreiser
Copy link
Member

perhaps we inadvertently changed the size of the objects being cached there and thus invalidated the default. how much ram does your system have? we typically run this on servers with 32 or 64 gb of ram (in the cloud)

@mkandulavm
Copy link
Author

mkandulavm commented Apr 5, 2024

I am using a server with 64 GB RAM and have a swap of ~250 GB.

To give a gross estimate, the memory consumption was upto ~15 GB (this is per thread, i think with my 8 threads it was ~100GB). Also once the test was done, this memory was not getting released. Possibly meaning that the element of each map object (and its associated references) in the grid represents a very large memory size.

After making grid.cache_size to be 100 , the consumption of RAM did not exceed ~3-4 GB per thread.

All above running only map matching (trace_attributes) calls only.

Again, I am giving a very high level idea. I can try writing some code to calculate the actual size if it helps, but will take some time.

@nilsnolde
Copy link
Member

After making grid.cache_size to be 100 , the consumption of RAM did not exceed ~3-4 GB per thread.

jeez, that's still an awful lot and more than most standard machine configs with 2 gb/thread. will that be filled over time or pre-allocated?

@mkandulavm
Copy link
Author

I apologise as could not make time/effort in trying trying to estimate the memory.
The map matching itself is stable without any excess memory consumption.

Currently, I am experiencing a crash in routing with only this information after ~24 hours

[INFO] algorithm::bidirectional_a*
2024/05/13 08:47:39.904306 [INFO] algorithm::bidirectional_a*
double free or corruption (out)
Aborted (core dumped)

I have reset the grid.cache_size to 100240 to see if this has something to do with routing. Will update again and try closing the issue if its not related.

Thank you.

@kevinkreiser
Copy link
Member

double free would be a bug not usually an out of memory situation. can you make a debug build and attach the stack trace here?

@mkandulavm
Copy link
Author

yes. making the build and updating our server. once I get the core dump and get a stack, I will share.

@mkandulavm
Copy link
Author

after using a debug build, I found the assert is failing at this location in graphtile.h

const NodeInfo* node(const GraphId& node) const {
assert(node.Tile_Base() == header_->graphid().Tile_Base());

Currently we are processing the osm and stripping data not related to routing. This has sped up the time to generate the graph tiles. I am wondering if this is causing the issue, but can't pin point.

Please note that the issue is not easily reproducible.

@kevinkreiser
Copy link
Member

Currently we are processing the osm and stripping data not related to routing.

yeah that leaves a lot of room for the imagination that could lead to the problem you are seeing. basically the problem is that there is an edge with an end node and the index for that node is not found in your data. this could mean the node isnt there or that the edges data is corrupted. its hard to say..

@mkandulavm
Copy link
Author

If the edges are corrupted, then the crash should be reproducible for the same request right ?
However, this is not happening.

Its possible the memory corruption happened somewhere else (in our scope of code), leading to routing failure. This is typically the case with seg faults. I will update once I check. Thank you.

@mkandulavm
Copy link
Author

mkandulavm commented May 16, 2024

Hi

I got stack trace after long testing. The system becomes unresponsive and eventually dumps with a crash (was able to repeat for 2 dumps and got similar stacks). The stack is reflecting the double-free. This is ran with graph tiles generated with unaltered geofabrik osm.

0 0x000055bfad3b27c5 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (
this=0x55c044932d50) at /usr/include/c++/10/bits/shared_ptr_base.h:158
1 0x000055bfad3a84a7 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count (
this=0x7f15531c1f98, __in_chrg=) at /usr/include/c++/10/bits/shared_ptr_base.h:733
2 0x000055bfadbbfd76 in std::__shared_ptr<valhalla::baldr::GraphTile const, (__gnu_cxx::_Lock_policy)2>::
__shared_ptr (this=0x7f15531c1f90, __in_chrg=)
at /usr/include/c++/10/bits/shared_ptr_base.h:1183
3 0x000055bfadbbfd92 in std::shared_ptr<valhalla::baldr::GraphTile const>::~shared_ptr (
this=0x7f15531c1f90, __in_chrg=) at /usr/include/c++/10/bits/shared_ptr.h:121
4 0x000055bfade82031 in valhalla::loki::Reach::operator() (this=0x7f15531c2538, edge=0x7f1c19e9a408,
edge_id=..., max_reach=50, reader=..., costing=..., direction=3 '\003')
at /home/valhalla/src/loki/reach.cc:74
5 0x000055bfadc58b39 in (anonymous namespace)::bin_handler_t::check_reachability (
this=0x7f15531c24b0, begin=..., end=..., tile=..., edge=0x7f1c19e9a408, edge_id=...)
at /home/valhalla/src/loki/search.cc:492
6 0x000055bfadc59330 in (anonymous namespace)::bin_handler_t::handle_bin (this=0x7f15531c24b0,
begin=..., end=...) at /home/valhalla/src/loki/search.cc:587
7 0x000055bfadc59d45 in (anonymous namespace)::bin_handler_t::search (this=0x7f15531c24b0)
at /home/valhalla/src/loki/search.cc:698
8 0x000055bfadc5aa36 in valhalla::loki::Search (locations=..., reader=..., costing=...)
at /home/valhalla/src/loki/search.cc:825
9 0x000055bfadc54e0e in valhalla::loki::loki_worker_t::route (this=0x7f1524003420, request=...)
at /home/valhalla/src/loki/route_action.cc:105
10 0x000055bfada6831c in valhalla::tyr::actor_t::route(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void ()> const*, valhalla::Api*) (
this=0x7f15240053a0, request_str=..., interrupt=0x7f15531c3060, api=0x7f15531c2d70)
at /home/valhalla/src/tyr/actor.cc:103

@mkandulavm
Copy link
Author

UPDATE : I think the source of the problem is because of me setting auto_cleanup to true as a default when the actor_t object is created. To use the routing multi-threaded, I am creating multiple actors with a parent actor's graph reader passed into it. But for all these actors I am setting auto_cleanup as true.
As I reverted this back to false, there is no crash (testing is going on...), but the behaviour of the system is stable now. Will confirm by tomorrow.

@kevinkreiser
Copy link
Member

if you want threads to share a graphreader you need to use the synchronized tile cache options for graphreader. what is happening is that when you call cleanup on one of the actors, it will remove stuff from the graphreader cache. if another thread is currently using that tile but the memory is destroyed then all kinds of crazy stuff could occur.

@mkandulavm
Copy link
Author

ok. will check with this. It makes sense now, but it was not trivial to know. Maybe add a comment or assert ?

@mkandulavm
Copy link
Author

if you want threads to share a graphreader you need to use the synchronized tile cache options for graphreader. what is happening is that when you call cleanup on one of the actors, it will remove stuff from the graphreader cache. if another thread is currently using that tile but the memory is destroyed then all kinds of crazy stuff could occur.

This was helpful. The crash issue is fixed.

I will close this thread as I am not having any issues related to this and the trace_attributes call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants