Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRT form of merging #2699

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

VRT form of merging #2699

wants to merge 7 commits into from

Conversation

sgillies
Copy link
Member

@sgillies sgillies commented Jan 2, 2023

There's a lot of overlap between rasterio's merge and GDAL's GDALBuildVRT utility. I believe having two different ways to do the same kind of thing is a step backwards for this project. As an alternative to #2573, which wraps GDALBuildVRT, I've reimplemented the GDAL utility in Python, building on existing code in _boundless_vrt_doc.

Advantages over using GDALBuildVRT:

  • It's compatible with Rasterio dataset objects. GDALBuildVRT only understands GDAL filenames.
  • It returns an XML string, making temporary files on the filesystem unnecessary.
  • It's Python. Easy to read and modify. GDALBuildVRT is C++. Harder to read and modify.
  • It shares code with _boundless_vrt_doc. We can continue to refactor and develop a better virtual dataset story throughout Rasterio.
  • It shares code with merge. The functions could be refactored so that merge uses virtual_merge, which also has the potential to resolve other issues in our tracker (like Improve performance of rio-merge #507 and [WIP] Tiled merge #2221).

The name of this function and the module it lives in are undecided.

This can totally replace gdal.BuildVRT for OSMnx (in #2573) and doesn't require wrapping GDALBuildVRT. Just because a Rasterio user wants something that GDALX does doesn't mean we need to literally expose GDALX from Rasterio. GDAL's API design can be improved upon and Rasterio is the project to do those improvements.

Resolves #2573

@sgillies sgillies added the vrt label Jan 2, 2023
@sgillies sgillies self-assigned this Jan 2, 2023
@sgillies
Copy link
Member Author

sgillies commented Jan 3, 2023

@snowman2 the code here is basically correct! The XML produced by virtual_merge(glob("tests/data/rgb?.tif")) becomes

test_virtual_merge

What do you think? I want Rasterio to aim high when it comes to API design. It's pragmatic to wrap gdalbuildvrt, but I think we can do better.

@snowman2
Copy link
Member

snowman2 commented Jan 3, 2023

@sgillies, your logic makes sense to me - thanks for working on this 👍

This was referenced Jan 3, 2023
rasterio/merge.py Outdated Show resolved Hide resolved
@sgillies sgillies marked this pull request as ready for review January 4, 2023 22:55
@sgillies sgillies marked this pull request as draft January 4, 2023 22:55
@gboeing
Copy link

gboeing commented Jan 5, 2023

This can totally replace gdal.BuildVRT for OSMnx (in #2573) and doesn't require wrapping GDALBuildVRT. Just because a Rasterio user wants something that GDAL.X does doesn't mean we need to literally expose GDAL.X from Rasterio. GDAL's API design can be improved upon and Rasterio is the project to do those improvements.

@sgillies that reasoning makes perfect sense. Thanks!

@sgillies
Copy link
Member Author

sgillies commented Jan 6, 2023

@gboeing in the meanwhile, you have the option of copying the new virtual_merge function into osmnx if you want. I don't imagine the interface or functionality will change much between now and 1.4.0.

@sgillies sgillies marked this pull request as ready for review January 6, 2023 17:01
@snowman2
Copy link
Member

snowman2 commented Jan 6, 2023

@sgillies, thoughts about adding more tests? Could start by copying some from #2698.

I am thinking it would be good to verify that the bounds/transform/data of the merged VRT are as expected. Also probably a good idea to test merging multiple files.

If rasters with different CRS are passed into this function, should it raise an exception? Should we test that?

Whether to adjust output image bounds so that pixel coordinates
are integer multiples of pixel size, matching the ``-tap``
options of GDAL utilities. Default: False.
dst_path : str or PathLike, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dst_path used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dst_kwds used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are left over from the original merge. TODO: remove them.

@sgillies
Copy link
Member Author

sgillies commented Jan 6, 2023

@snowman2 @groutr (adding you since you're a key merge contributor) if we have a virtual merging method like this one, what would you think about it returning a "Virtual Dataset" object instead of XML or an infoset?

It could be very lightweight, like our existing MemoryDataset. MemoryDataset is a thin wrapper around a "/vsimem/..." dataset. A virtual dataset could be a thin wrapper around an XML bytestring. The usage could be a lot like MemoryDataset's. I've got an additional method in mind: a VirtualDataset.materialize("example.tif") that would call GDALCopyCreate to make a non-virtual dataset.

I'll throw a quick example together.

rasterio/merge.py Outdated Show resolved Hide resolved
@groutr
Copy link
Contributor

groutr commented Jan 6, 2023

@snowman2 @groutr (adding you since you're a key merge contributor) if we have a virtual merging method like this one, what would you think about it returning a "Virtual Dataset" object instead of XML or an infoset?

It could be very lightweight, like our existing MemoryDataset. MemoryDataset is a thin wrapper around a "/vsimem/..." dataset. A virtual dataset could be a thin wrapper around an XML bytestring. The usage could be a lot like MemoryDataset's. I've got an additional method in mind: a VirtualDataset.materialize("example.tif") that would call GDALCopyCreate to make a non-virtual dataset.

I'll throw a quick example together.

I like the idea, though I would suggest choosing a method name that is easier to type like VirtualDataset.compute("example.tif") (to mirror the dask ecosystem terminology for realizing a result) or VirtualDataset.export("example.tif") or VirtualDataset.output("example.tif"). Another option that might fit well with rasterio is VirtualDataset.to_dataset("example.tif") (which seems like a very natural way to expect turning a virtual dataset into an actual dataset).

@snowman2
Copy link
Member

snowman2 commented Jan 6, 2023

A virtual dataset could be a thin wrapper around an XML bytestring.

I like that idea.

@groutr
Copy link
Contributor

groutr commented Jan 6, 2023

A virtual dataset could be a thin wrapper around an XML bytestring.

We might consider adding internal validation against the VRT schema (https://raw.githubusercontent.com/OSGeo/gdal/master/data/gdalvrt.xsd). xmlschema seems like a nice package for validation.

rasterio/vrt.py Outdated

"""
# To avoid a circular import.
from rasterio.io import MemoryFile
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to put VirtualDataset with the rest of the Dataset objects in rasterio.io?

rasterio/vrt.py Outdated Show resolved Hide resolved
rasterio/vrt.py Outdated Show resolved Hide resolved
rasterio/vrt.py Outdated

kwargs : dict
Optional dataset opening options to be passed to
rasterio.open().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rasterio.open().
:func:`rasterio.open`.

rasterio/vrt.py Outdated Show resolved Hide resolved
rasterio/vrt.py Outdated
from rasterio.io import MemoryFile

tree = ET.fromstring(text)
doc = ET.tostring(tree)
Copy link
Contributor

@groutr groutr Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding why we are parsing an xml string into an elementtree, only to write it back out to an xml string. Can we not use text as is?

Copy link
Member Author

@sgillies sgillies Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@groutr I didn't have any intention other than 1) very basic validation (is it correct XML?) and 2) convert from bytes/unicode to just bytes.

I think we may well want to do schema validation. But for now, this is just a sketch.

@snowman2
Copy link
Member

@sgillies, I am thinking that it may make sense to move the VirtualDataset to a separate PR. I think that will help focus the discussion. Thoughts?

@groutr
Copy link
Contributor

groutr commented Jan 11, 2023

Is the goal of the VirtualDataset to replicate the functionality of a rasterio Dataset, permit construction and then be able to render a vrt xml or is it simply serving as a VRT wrapper for MemoryFile?

@sgillies sgillies added this to the 1.4.0 milestone Apr 27, 2023
else:

@contextmanager
def nullcontext(obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this could be confused with contextlib.nullcontext, I propose renaming this to something like:

Suggested change
def nullcontext(obj):
def dataset_opener(obj):

And removing dataset_opener = nullcontext below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that rasterio supports python 3.9+, why not use contextlib.nullcontext directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function defined here behaves differently from contextlib.nullcontext. nullcontext returns None whereas this function returns the file handle.

Copy link
Contributor

@groutr groutr Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contextlib.nullcontext will return whatever you pass as the argument (the default is None).
https://docs.python.org/3/library/contextlib.html#contextlib.nullcontext

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - then I agree with your suggestion to use contextlib.nullcontext.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@groutr great suggestion! I will make it so.


ET.SubElement(vrtdataset, "SRS").text = crs_wkt if crs_wkt else ""
ET.SubElement(vrtdataset, "GeoTransform").text = ",".join(
[str(v) for v in output_transform.to_gdal()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work as well:

Suggested change
[str(v) for v in output_transform.to_gdal()]
str(v) for v in output_transform.to_gdal()

nodataval = nodata
else:
warnings.warn(
"The nodata value, %s, is beyond the valid "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The nodata value, %s, is beyond the valid "
f"The nodata value, {nodataval}, is beyond the valid "

else:
warnings.warn(
"The nodata value, %s, is beyond the valid "
"range of the chosen data type, %s. Consider overriding it "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"range of the chosen data type, %s. Consider overriding it "
f"range of the chosen data type, {dt}. Consider overriding it "

warnings.warn(
"The nodata value, %s, is beyond the valid "
"range of the chosen data type, %s. Consider overriding it "
"using the --nodata option for better results." % (nodataval, dt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"using the --nodata option for better results." % (nodataval, dt)
"using the --nodata option for better results."

@sgillies
Copy link
Member Author

Thanks for all the suggestions @groutr @snowman2 ! I'm going to accept them as soon as I'm back from my Wisconsin trail running vacation 😆

@sgillies sgillies removed this from the 1.4.0 milestone Sep 1, 2023
@sgillies
Copy link
Member Author

sgillies commented Sep 1, 2023

Deferring until 1.5.0. There's a lot to this and it's not ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add rasterio.vrt.BuildVRT
5 participants