You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In short, SLEAP can easily overload RAM when the array of tracks becomes large. In my case, it is trying to pin a 34 GB object to memory, which completely freezes the system. This is particularly bad for long videos with noisy backgrounds, e.g., recording all day in a naturalistic environment (which is unfortunately the bread and butter of our lab). This has happened both on ubuntu and windows. I've run into this issue in other contexts in the past (see #1288), but the most recent issue is particularly bad because it completely locks up the system requiring a hard reset. After some messing around, I have found I am able to generally prevent this my limiting max_instances per frame, and looking back at the previous issues, I see that there is now a --tracking.max_tracks argument that should put a hard cap on the proliferation of tracks. Still I think my suggestions below might be worthwhile, given how frustrating it is to have your whole computer freeze, especially if you're working on a remote server.
Expected behaviour
Ideally, I would expect it to a) not need to use so much RAM that it would freeze the system and b) if it does, raise a warning and adjust or raise an error and close rather than crashing the whole computer.
If I understand correctly, sleap generates a dense array of tracks, so it can be very memory intensive for long videos with many tracklets. I understand there may be performance/dependency issues that make changing this difficult, but I wonder if it is possible to implement this as a sparse array to prevent size multiplication.
Barring that, it would be useful to add some memory controls so that SLEAP can fail gracefully if it is beginning to overload the system (e.g., attempting to generate an object that is bigger than either of the sticks of RAM). Resource management isn't something I understand super well though, so this might not be feasible.
Actual behaviour
When running inference on a 30 min video (25 fps), my computer suddenly froze. Looking back at the log, this is what it reported before it stopped (there are more logs, if you want them)
2023-12-12 21:36:59.476568: E tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 34357641216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-12-12 21:36:59.477290: W ./tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34357641216
2023-12-12 21:06:18.445037: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -95 } dim { size: -96 } dim { size: -97 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -14 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -14 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA GeForce RTX 3060" frequency: 1867 num_cores: 28 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 2359296 shared_memory_size_per_multiprocessor: 102400 memory_size: 10033496064 bandwidth: 360048000 } outputs { dtype: DT_FLOAT shape { dim { size: -14 } dim { size: -98 } dim { size: -99 } dim { size: 1 } } }
2023-12-12 21:06:19.408565: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201
2023-12-12 21:36:59.476568: E tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 34357641216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-12-12 21:36:59.477290: W ./tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34357641216
Screenshots
Here's a picture from the video that crashed it. Incidentally, this isn't even a video we need to process, the fish had already been removed days before but someone forgot to change the camera schedule. So you can see it's really a worst case scenario for many noisy, fish-like background detections.
How to reproduce
If you'd like, I can share the video and sleap models that caused this. Here is the command I ran (from within a snakemake pipeline):
sleap-track -m {params.centered} -m {params.centroid} --peak_threshold 0.4 --tracking.tracker simple --tracking.similarity centroid --tracking.track_window 5 {input} -o snake/sleap/{wildcards.video}.predictions.slp 2>> {params.log};"
Since it happened, I've changed to setting tracking.target_instance_count to 8 (there are 4 fish, but I do some post processing to filter out bad detections), and it hasn't failed with that on, although I think it theoretically could if track assembly went badly, and last night I accidentally used the old command and froze my system again while working remotely, so I wrote this up while waiting for someone to get to the lab to reset it.
As always, I really appreciate everything all of you do to make this such an amazing package, over the break we are set to process thousands of fish days worth of data, thanks for making that possible.
The text was updated successfully, but these errors were encountered:
I was able to at least prevent my computer crashing by using ulimit -v 28000000, this was stricter than it needed to be (some get killed by ulimit when they would have been able to run without eating all the RAM), but it at least prevented by computer from freezing up unexpectedly, but I still do not know how to run these in a way that produces useful output.
I tried using --tracking.max_tracks, but that doesn't seem to work? I set max tracks to 20 but still got 100s of tracks on a 2500 frame sample video.
for reference, here's the parameters used for max tracks:
Another update, on updating to the more recent version of SLEAP (1.3.3) and using the --tracking.tracker simplemaxtracks input, now it works properly and (presumably) will not overflow memory anymore. I'll add more updates if I find anything else important.
Bug description
In short, SLEAP can easily overload RAM when the array of tracks becomes large. In my case, it is trying to pin a 34 GB object to memory, which completely freezes the system. This is particularly bad for long videos with noisy backgrounds, e.g., recording all day in a naturalistic environment (which is unfortunately the bread and butter of our lab). This has happened both on ubuntu and windows. I've run into this issue in other contexts in the past (see #1288), but the most recent issue is particularly bad because it completely locks up the system requiring a hard reset. After some messing around, I have found I am able to generally prevent this my limiting max_instances per frame, and looking back at the previous issues, I see that there is now a --tracking.max_tracks argument that should put a hard cap on the proliferation of tracks. Still I think my suggestions below might be worthwhile, given how frustrating it is to have your whole computer freeze, especially if you're working on a remote server.
Expected behaviour
Ideally, I would expect it to a) not need to use so much RAM that it would freeze the system and b) if it does, raise a warning and adjust or raise an error and close rather than crashing the whole computer.
If I understand correctly, sleap generates a dense array of tracks, so it can be very memory intensive for long videos with many tracklets. I understand there may be performance/dependency issues that make changing this difficult, but I wonder if it is possible to implement this as a sparse array to prevent size multiplication.
Barring that, it would be useful to add some memory controls so that SLEAP can fail gracefully if it is beginning to overload the system (e.g., attempting to generate an object that is bigger than either of the sticks of RAM). Resource management isn't something I understand super well though, so this might not be feasible.
Actual behaviour
When running inference on a 30 min video (25 fps), my computer suddenly froze. Looking back at the log, this is what it reported before it stopped (there are more logs, if you want them)
2023-12-12 21:36:59.476568: E tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 34357641216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2023-12-12 21:36:59.477290: W ./tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34357641216
Your personal set up
Environment packages
Logs
Screenshots
Here's a picture from the video that crashed it. Incidentally, this isn't even a video we need to process, the fish had already been removed days before but someone forgot to change the camera schedule. So you can see it's really a worst case scenario for many noisy, fish-like background detections.
How to reproduce
If you'd like, I can share the video and sleap models that caused this. Here is the command I ran (from within a snakemake pipeline):
sleap-track -m {params.centered} -m {params.centroid} --peak_threshold 0.4 --tracking.tracker simple --tracking.similarity centroid --tracking.track_window 5 {input} -o snake/sleap/{wildcards.video}.predictions.slp 2>> {params.log};"
Since it happened, I've changed to setting tracking.target_instance_count to 8 (there are 4 fish, but I do some post processing to filter out bad detections), and it hasn't failed with that on, although I think it theoretically could if track assembly went badly, and last night I accidentally used the old command and froze my system again while working remotely, so I wrote this up while waiting for someone to get to the lab to reset it.
As always, I really appreciate everything all of you do to make this such an amazing package, over the break we are set to process thousands of fish days worth of data, thanks for making that possible.
The text was updated successfully, but these errors were encountered: