Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual dataset #2712

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Virtual dataset #2712

wants to merge 3 commits into from

Conversation

sgillies
Copy link
Member

@sgillies sgillies commented Jan 11, 2023

Virtual datasets with deferred computation of data are one of GDAL's killer features. VRTs can represent rasters that are much larger than our computer's memory and can also layer transformations on top of non-virtual datasets (GeoTIFFs, etc).

How can Python developers make a VRT? The two most common options are very different: 1) call a high level GDAL API method like GDALBuildVRT (aka the gdalbuildvrt utility) or GDALCreateCopy (aka the gdal_translate utility), or 2) build up a VRT XML document using an XML library or by interpolating into an XML template. I think something in-between could be very useful. An API that leverages xml.etree (for example) and is easy to extend and modify, feels like existing Rasterio APIs, and glosses over details of the VRT XML schema.

An interesting thing about GDAL's data model is that deferred computation of VRT pixels is only an implementation detail. There are no "immediate" and "lazy" datasets, just datasets, some of which might be lazy. What if you wanted lazily evaluated rasters to be a type in your application? GDAL doesn't provide this. A rasterio virtual dataset class can.

Virtual datasets would support common operations like:

  • padding (like rasterio's boundless reads)
  • clipping
  • merging
  • stacking
  • layering of metadata and transformations

Virtual datasets would be composable, so that you could virtually stack virtual mosaics.

Ideally, a virtual dataset class would be compatible with Xarray and Dask. As a backend.

This PR contains a sketch of a virtual dataset class.

@snowman2
Copy link
Member

I had a thought that analogous to pyproj CRS building there could potentially be sub-elements for a VRTDataset represented as classes that could be used to build a VRTDataset.

@snowman2
Copy link
Member

@groutr
Copy link
Contributor

groutr commented Jan 12, 2023

I've been investigating my thoughts for a VirtualDataset over the last couple of days. @sgillies I hope I'm not misunderstanding the goal of such a class. I gather that you wanted a thin wrapper around an xml document, and that you also seem to want to use VRT for intermediate results inside rasterio which seems to imply the ability to construct/build virtual datasets from scratch.

I think @snowman2 has a good idea to represent the elements of a virtual dataset using classes. Doing this reminds me a lot of ORM libraries for relational databases.

A slightly different path came to my mind. Going along with the ORM theme, I wondered if maybe creating an abstract model of a dataset would be a good idea. VirtualDataset and the current Dataset could be derived from this abstract model and then implement the mapping between the abstract methods/properties and either the xml elements (for virtual datasets) or calls to gdal (for regular datasets). A VRT xml could then be rendered from this representation (relying on the respective implementations for abstract methods/properties). This approach would decouple a dataset representation from gdal implementation.

An incredibly crude illustration.

class AbstractDataset(ABC):    
    @property
    @abstractmethod
    def shape(self):
        pass

class GDALDatasetReader(AbstractDataset):
    def __init__(self, hds, x=0, y=0):
        self._hds = hds

    @property
    def shape(self):
        return (GDALGetRasterYSize(self._hds), GDALGetRasterXSize(self._hds))
    
class VirtualDataset(AbstractDataset):
    def __init__(self, x=0, y=0):
        self.shape = (y, x)

    @property
    def shape(self):
        return self._shape
    
    @shape.setter
    def shape(self, new_shape):
        self._shape = new_shape
        
def to_vrt_xml(dataset):
    #Export any concrete subclass of AbstractDataset to vrt xml
    tree = ET.ElementTree("VRTDataset", rasterXSize=str(dataset.shape[0]), rasterYSize=str(dataset.shape[1]))
    return tree.tostring()

These are just some thoughts. I like direction of the current approach too.

@sgillies
Copy link
Member Author

@groutr I added more details in the original comment. I'll read yours closely today.

@rouault
Copy link
Contributor

rouault commented Jan 13, 2023

GDAL doesn't provide this

Actually https://github.com/OSGeo/gdal/blob/master/frmts/vrt/gdal_vrt.h and https://github.com/OSGeo/gdal/blob/master/frmts/vrt/vrtdataset.h are public installed headers

@sgillies
Copy link
Member Author

@rouault thank you for the correction! Indeed, I am proposing something very much like VRTDataset (and GDALDataset), but simpler, more limited, only for VRT datasets (though it could be extended to MEM), and in Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants