Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when looping through data variables of a dataset loaded from a VRT #774

Open
amaissen opened this issue May 3, 2024 · 2 comments
Labels
question Further information is requested

Comments

@amaissen
Copy link

amaissen commented May 3, 2024

Code Sample, a copy-pastable example if possible

A "Minimal, Complete and Verifiable Example" will make it much easier for maintainers to help you:
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

import rioxarray as rxr
import xarray as xr
import gc

PATH = "path_to_multi_band_vrt.vrt"

def memory_leak():
  raster = rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1})
  bands = list(raster.data_vars)
  
  for band in bands:
    data = raster[band].copy(deep=True).load()
    
    delete data
    gc.collect()

Problem description

The allocated memory increases after each iteration.

Expected Output

The memory is released after each iteration, so one can process multi-band datasets that do not fit in memory.

Environment Information

rioxarray (0.15.5) deps:
 rasterio: 1.3.10
   xarray: 2024.3.0
     GDAL: 3.8.4
     GEOS: 3.11.1
     PROJ: 9.3.1
PROJ DATA: /opt/conda/envs/some-env/share/proj
GDAL DATA: /opt/conda/envs/some-env/share/gdal

Other python deps:
    scipy: 1.13.0
   pyproj: 3.6.1

System:
   python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
executable: /opt/conda/envs/some-env/bin/python
  machine: Linux-5.15.0-101-generic-x86_64-with-glibc2.35

Conda environment information (if you installed with conda):


Environment (conda list):
gdal                      3.8.5           py310h3b926b6_2    conda-forge
libgdal                   3.8.5                hf9625ee_2    conda-forge
rasterio                  1.3.10                   pypi_0    pypi
rioxarray                 0.15.5                   pypi_0    pypi
xarray                    2024.3.0                 pypi_0    pypi

@amaissen amaissen added the bug Something isn't working label May 3, 2024
@snowman2 snowman2 added question Further information is requested and removed bug Something isn't working labels May 3, 2024
@snowman2
Copy link
Member

snowman2 commented May 3, 2024

Add these kwargs to open_rasterio to disable caching:

lock=False,  # disable internal caching
cache=False,  # don't keep data loaded in memory. pull from disk every time

@amaissen
Copy link
Author

amaissen commented May 3, 2024

@snowman2 , thanks for pointing to these options. I tired the options you suggested but they did not help to release the memory.

However, when I store the entire raster to zarr storage with to_zarr(), and load with raster = xarray.open_zarr(...), I don't see any memory leaks when iterating through the data variables. This would look like

import rioxarray as rxr
import xarray as xr
import gc

PATH = "path_to_multi_band_vrt.vrt"

def no_memory_leak():
  # Read from VRT and save to zarr (one chunk per band)
  rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1}).to_zarr(some_temp_dataset)
  
  # Open zarr and iterate over data vars.
  raster = xr.open_zarr(some_temp_dataset, chunks={"x": -1, "y": -1})
  bands = list(raster.data_vars)
  
  for band in bands:
    data = raster[band].copy(deep=True).load()
  
    del data
    gc.collect()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants