Data leak using Ray Core and Ray Workflows #3

julietcohen · 2022-10-21T22:56:45Z

Data leak

While executing the PDG workflow with Ray, @KastanDay is consistently experiencing a data leak. The memory usage on NCSA's Delta server increases as the process runs, resulting in a crash. @robyngit suggested the issue might be in the rasterization step (vector -> raster).

Debugging approaches by Kastan and Robyn:

Failed to get memory debugger to print within remote() function
Run memory debugger on rasterize_vectors() without using ray

tr = tracker.SummaryTracker()
tr.print_diff()
print(':point_up: start of rasterize_vectors')

<run code here>

tr.print_diff()
print(':point_up: END of rasterize_vectors')

rasterizer = pdgraster.RasterTiler(config)
from pympler.classtracker import ClassTracker
tracker = ClassTracker()
tracker.track_object(rasterizer, resolution_level=2)
tracker.track_class(pdgraster.RasterTiler)
tracker.create_snapshot() # summary 1
# do the work 
rasterizer.rasterize_vectors(staged_paths, make_parents=False)
tracker.create_snapshot() # summary 2
# delete my class instance
del rasterizer
tracker.create_snapshot() # summary 3
tracker.stats.print_summary()

result: no leak reported
tried resolution_level=3 (also tried 4, 5, 6)

using a context manager with the geodataframes

The text was updated successfully, but these errors were encountered:

robyngit · 2022-10-26T15:55:59Z

In the Raster class, in order to calculate pixel values based on the area of a pixel that a polygon covers, we use the geopandas overlay method to "slice" each polygon with the grid lines (see line 678-681).

I think that the overlay method could be producing a memory leak somehow. I made a minimal example where we create some row polygons, then create some column polygons, then overlay the two to create a grid. When we repeat this process many times, the memory usage starts to climb even though we are not storing the output (the grid GeoDataFrame) to any variable.

Here is the plot of memory usage over 200 runs:

This was created using the following python script:

from memory_profiler import profile
import geopandas as gpd
from shapely.geometry import box

# Number of rows & column polygons to make. Will results in n x n cells after
# overlay operation. The larger the number, the slower & more memory intensive
# the operation.
n = 100
# Number of times to repeat the same overlay process.
runs = 200

# Create n vertical rectangles (columns in a grid)
column_polygons = gpd.GeoDataFrame(
    geometry=[box(i, 0, i + 1, n) for i in range(n)])
# Create n horizontal rectangles (rows in a grid)
row_polygons = gpd.GeoDataFrame(
    geometry=[box(0, i, n, i + 1) for i in range(n)])


@profile
def suspected_leak():
    for i in range(0, runs):
        print(f'Run {i+1} of {runs}')
        # overlay the horizontal rectangles on the vertical rectangles,
        # resulting in n x n polygons (cells in a grid)
        column_polygons.overlay(row_polygons)


if __name__ == '__main__':
    suspected_leak()

To run the above code:

pip install memory-profiler geopandas
mprof run overlay_leak.py
mprof plot

Here is a plot from a slightly different version where we create a 500x500 grid (very slow) 10 times. The blue brackets indicate where the overlay operations start and stop:

I also tried tracking memory usage with pympler but I'm having a harder time interpreting the output.

For example, the following script ...

from pympler import tracker
import geopandas as gpd
from shapely.geometry import box


n = 200
runs = 5

column_polygons = gpd.GeoDataFrame(
    geometry=[box(i, 0, i + 1, n) for i in range(n)])
row_polygons = gpd.GeoDataFrame(
    geometry=[box(0, i, n, i + 1) for i in range(n)])


def suspected_leak():
    for i in range(0, runs):
        print(f'Run {i+1} of {runs}')
        column_polygons.overlay(row_polygons)


tr = tracker.SummaryTracker()
tr.print_diff()
print('\n👆 START of overlay operation\n\n')

suspected_leak()

tr.print_diff()
print('\n👆 END of overlay operation')

... gives the following output:

                                  types |   # objects |   total size
======================================= | =========== | ============
                                   list |        9650 |    829.05 KB
                                    str |        9650 |    694.05 KB
                                    int |        2487 |     68.01 KB
                                    set |          29 |      6.12 KB
                                weakref |          79 |      5.55 KB
             builtin_function_or_method |          48 |      3.38 KB
                                   dict |           6 |      1.22 KB
                      method_descriptor |          13 |    936     B
                                   type |           0 |    408     B
  pandas.core.arrays.numpy_.PandasArray |           2 |    144     B
                  function (store_info) |           1 |    136     B
  pandas.core.dtypes.dtypes.PandasDtype |           2 |     96     B
                     wrapper_descriptor |           1 |     72     B
                      member_descriptor |           1 |     64     B
                                 method |           1 |     64     B

👆 START of overlay operation


Run 1 of 5
Run 2 of 5
Run 3 of 5
Run 4 of 5
Run 5 of 5
                       types |   # objects |   total size
============================ | =========== | ============
                        code |           0 |      5.95 KB
               numpy.ndarray |           2 |      3.33 KB
                         str |          23 |      2.34 KB
                     StgDict |           2 |      1.16 KB
        _ctypes.PyCArrayType |           1 |      1.04 KB
      _ctypes.PyCPointerType |           1 |      1.04 KB
                       tuple |          10 |    664     B
                     weakref |           9 |    648     B
  builtin_function_or_method |           9 |    648     B
                        type |           0 |    496     B
                        list |           0 |    424     B
           getset_descriptor |           4 |    256     B
           weakcallableproxy |           1 |     72     B
                       bytes |           1 |     67     B
         _ctypes.DictRemover |           1 |     32     B

👆 END of overlay operation

The only potentially related issue I could find from geopandas is geopandas/geopandas#955, though this is about spatial joins (sjoin), not overlays

robyngit · 2022-11-02T22:04:00Z

Update!

TLDR

Installing the newest pygeos and upgrading shapely to match that version solves at least one memory leak in the workflow:

python -m pip install pygeos
python -m pip install --upgrade shapely

The long version

After a few dead ends (running malloc_trim after overlay operations, looking into alternative memory allocators like jemalloc, looking for memory leaks in pandas (on top of which GeoPandas is built), looking into using dask-geopandas instead of GeoPandas, etc.), I started piecing apart the code GeoPandas is using for the intersection overlay method. I tracked the source of the problem to this line. Commenting out this step completely flattened the memory profile plot when repeating overlay operations. The buffer(0) method in this line is used to correct invalid polygons created from intersections.

Turns out, that just calling buffer on a GeoSeries repeatedly was giving the same, ever increasing memory usage pattern.

Minimal example

from pympler import tracker
import geopandas as gpd
from shapely.geometry import box

tr = tracker.SummaryTracker()


def suspected_leak(n, runs):
    gdf = gpd.GeoDataFrame(
        geometry=[box(i, 0, i + 1, n) for i in range(n)])

    for i in range(0, runs):
        print(f'Run {i+1} of {runs}')
        gdf.geometry.buffer(0)


def main():
    n = 200
    runs = 20000

    print('\n👇 START of buffer operation\n\n')

    tr.print_diff()
    suspected_leak(n, runs)
    tr.print_diff()

    print('\n👆 END of buffer operation')


main()

Looking further into the buffer method revealed that GeoPandas uses different underlying methods depending on the availability of the Shapely vs. PyGEOS libraries. The preference is to use Shapely 2.0+ (in beta), then PyGEOS, then the older Shapely method. Looking at my pip freeze, I didn't have PyGEOS installed. Installing it (version 0.13) gave a warning about the incompatibility between my PyGEOS and Shapely versions, so I upgraded Shapely to the latest 1.8.5 version.

With these new library updates, repeating the overlay method stopped causing a memory leak. Overlay seems to be faster as well! The leak must originate in the older Shapely way that GeoPandas buffers geometries.

Apparently, PyGEOS and Shapely both call the C++ library GEOS, however, PyGEOS was considered the newer, faster method until December 2021, when the two libraries merged. Shapely 2.0 will eventually be released as the result of this merge.

Next steps

Test if this solves the memory leak: install PyGEOS, update Shapely, and re-run the rasterization step on Delta while monitoring the memory usage
Include PyGEOS 0.13+ & Shapely 1.8.5+ in setup.py

Relates to PermafrostDiscoveryGateway/viz-workflow#3

robyngit · 2022-11-03T16:34:37Z

The memory leak mystery continues! I tested rasterizing 250,000 staged files on Delta, on a single node, using the ray workflow with the updated Shapely and pyGEOS libraries. With glances, I saw the memory usage start at around 10%, quickly climb when the workflow started, then steadily increase over about an hour to reach 100% and crash. At that point, I had gotten through rasterizing at least 198,400 of the staged tiles. Looks like no improvement from before the update! I'll keep looking into what part of the rasterization process is causing this leak.

Related to PermafrostDiscoveryGateway/viz-workflow#3

robyngit · 2022-11-29T22:11:04Z

I believe I found the source of the memory leak for real this time: I was opening a rasterio In-Memory file without closing it, every time a tile was rasterized... The issue seems obvious in retrospect... 🙃

This is fixed in PermafrostDiscoveryGateway/viz-raster@c56a8bc, but I'm going to do a larger run on delta before merging to be certain that this resolves the leak.

robyngit · 2022-12-01T19:23:02Z

We are having trouble accessing delta resources, and unfortunately I could not get a run going on datateam that used anywhere near 100% memory usage. However, here is a little test that demonstrates why I believe that the issue we initially saw is fixed.

Given the following script run with mprof run test.py...

# test.py
from pdgraster import RasterTiler

def measure_memory(n):
    tiler = RasterTiler({})
    ex_path = 'staged/WorldCRS84Quad/13/1402/902.gpkg'
    for i in range(1, n):
        print(f'Job {i} of {n}')
        tiler.rasterize_vector(ex_path)

if __name__ == '__main__':
    measure_memory(50)

... Here is the memory profile plot using the main branch version of viz-raster:

... and here it is using the bug-mem-leak branch version of viz-raster with the new fix:

I am going to merge the fix into main and consider this issue resolved, but we can re-open if needed.

robyngit added the bug Something isn't working label Oct 24, 2022

robyngit self-assigned this Nov 2, 2022

robyngit added a commit to PermafrostDiscoveryGateway/viz-staging that referenced this issue Nov 3, 2022

Add newer Shapely and PyGEOS to setup requirements

1e53fab

Relates to PermafrostDiscoveryGateway/viz-workflow#3

robyngit added a commit to PermafrostDiscoveryGateway/viz-raster that referenced this issue Nov 3, 2022

Add newer Shapely and PyGEOS to setup requirements

f54fcc1

Relates to PermafrostDiscoveryGateway/viz-workflow#3

robyngit mentioned this issue Nov 9, 2022

Errors related to using pygeos #10

Closed

robyngit added a commit to PermafrostDiscoveryGateway/viz-raster that referenced this issue Nov 29, 2022

Close rasterio in-memory file when creating Raster

c56a8bc

Related to PermafrostDiscoveryGateway/viz-workflow#3

robyngit closed this as completed Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data leak using Ray Core and Ray Workflows #3

Data leak using Ray Core and Ray Workflows #3

julietcohen commented Oct 21, 2022 •

edited by robyngit

robyngit commented Oct 26, 2022

robyngit commented Nov 2, 2022 •

edited

robyngit commented Nov 3, 2022

robyngit commented Nov 29, 2022

robyngit commented Dec 1, 2022

Data leak using Ray Core and Ray Workflows #3

Data leak using Ray Core and Ray Workflows #3

Comments

julietcohen commented Oct 21, 2022 • edited by robyngit

Data leak

Debugging approaches by Kastan and Robyn:

robyngit commented Oct 26, 2022

robyngit commented Nov 2, 2022 • edited

TLDR

The long version

Next steps

robyngit commented Nov 3, 2022

robyngit commented Nov 29, 2022

robyngit commented Dec 1, 2022

julietcohen commented Oct 21, 2022 •

edited by robyngit

robyngit commented Nov 2, 2022 •

edited