GDAL Error Reported in logs, but rasterio method never returns #3028
Replies: 4 comments 13 replies
-
Hi @ryanherring, what about using timeout-decorator? |
Beta Was this translation helpful? Give feedback.
-
In this scenario, I can think about two things you could combine. First, have a custom memory monitoring in place, e.g. via psutil, that could be used at catching not just memory issues, and, second, to reduce the cases or perhaps even get rid of them, read by blocks / windows. |
Beta Was this translation helpful? Give feedback.
-
@ryanherring so, when memory allocation fails in Python you get a Python exception, but when memory allocation fails in a GDAL routine there may be a deadlock? The GDAL project has fixed a potential deadlock involving multi-threading decoding of TIFFs as recently as last fall: OSGeo/gdal#8561. I don't think that one was necessarily related to memory. Without knowing the version of GDAL you're using or what format you're reading, it's hard to say more. It's worth a search through GDAL's tracker for relevant open or closed issues. |
Beta Was this translation helpful? Give feedback.
-
@zstatmanweil 👋 the INFO level log message comes from here: https://github.com/rasterio/rasterio/blob/main/rasterio/_env.pyx#L85-L94. Note the comment: sometimes GDAL will call In our situation, GDAL tries to allocate memory here: https://github.com/OSGeo/gdal/blob/e187124a8462439d7a94581bc1cb062d9df51ac8/port/cpl_vsisimple.cpp#L1197. This fails and |
Beta Was this translation helpful? Give feedback.
-
I was hoping to be able to provide a more comprehensive way to reproduce the issue I'm seeing, but it's triggering so rarely in a remote environment that I have not been able to reproduce it on my laptop. I figured it was worth asking and seeing how I may be able to gather more info. The issue I'm seeing is that this block of code never returns (hangs indefinitely):
A log message is printed out shortly after entering this block and then no other log messages are printed and the process never terminates or has any Python exception thrown.
The raster file I'm trying to open has the shape
(6, 5478, 4740)
and thedtype
isfloat32
. This read is also taking place in a loop that reads in other rasters of a similar size and this one is the 8th or so that's read in when the issue occurs.I'm running this code over a very large number of rasters in a remote environment that uses Docker and runs in AWS. If I specify a small amount of memory (e.g. 4Gi) for the container, the call to read the data fails quickly with this exception:
When I run the same code with a container size of 16Gi, then I hit the issue described above (error message and task hangs until manually killed).
I want to make it clear that I understand that the size of 16Gi is insufficient to read all of the data into memory. If I set the memory higher, it succeeds. The issue is that I am only able to determine that the increase in memory is necessary by observing a hanging task and manually intervening. (Our team has code that automatically retries tasks with more memory if a task failed with an out of memory exception, but since the task is hanging, that logic is unable to kick in.)
If anyone has any thoughts on how to dig into this further, please let me know!
Beta Was this translation helpful? Give feedback.
All reactions