Geospatial ingestion - WIP #507

normanb · 2024-01-30T01:44:41Z

For sc-38757

The get_metadata groups the source files by destination raster block which is scalable for task graphs. For point clouds and geometries it will return the filtered input files. I checked the test results against expected outputs with GDAL using QGIS.

@sgillies can we review the use of nodata in the raster ingest from rasterio? I am not able to override nodata for the destination raster and this may need a change in the GDAL driver.

Geometries and point clouds are simpler and I will add these as a separate commit.

sgillies

@normanb @ktsitsi I've got some comments, questions, and suggestions.

src/tiledb/cloud/geospatial/helpers.py

src/tiledb/cloud/geospatial/ingestion.py

normanb · 2024-01-30T18:20:41Z

Thanks @sgillies, I will make these changes and run black before adding more support for point clouds and geometries.

src/tiledb/cloud/geospatial/helpers.py

normanb · 2024-02-12T04:36:27Z

Items ToDo;

Add more tests
singular usage of rasterio.merge rather than warpedvrt
check nodata is propagated for rasters
Geometries (need pyopener or similar)

Can I get a review of the new code before I add these items?

ktsitsi

Looks proper in general, just some minor comments to be addressed.

src/tiledb/cloud/geospatial/ingestion.py

src/tiledb/cloud/utilities/_common.py

src/tiledb/cloud/geospatial/ingestion.py

normanb · 2024-02-16T01:06:03Z

Ingestion of geometries requires an update to use a tiledb VSI opener (a parameter).

thetorpedodog

This change is looking pretty good so far. While I have quite a few comments here, overall it's well-written and fairly easy to follow, and I think these suggestions can make it even more so.

thetorpedodog · 2024-02-14T20:25:52Z

src/tiledb/cloud/geospatial/ingestion.py

+BATCH_SIZE = 10
+
+
+@dataclass


Prefer @attrs.define in place of @dataclass, since it gives us some nice features…

thetorpedodog · 2024-02-16T16:34:15Z

src/tiledb/cloud/geospatial/ingestion.py

+                    if mins is None:
+                        mins = hdr.mins
+                    else:
+                        mins = [min(e) for e in zip(mins, hdr.mins)]


This seems like it might be worth pulling into a function:

def _fold_in(oper, existing, new): if existing is None: return new return [oper(e) for e in zip(existing, new)]

thetorpedodog · 2024-02-16T16:35:37Z

src/tiledb/cloud/geospatial/ingestion.py

+    minx: float
+    miny: float
+    maxx: float
+    maxy: float
+    minz: Optional[float] = None
+    maxz: Optional[float] = None


would it make sense make these be min and max: Point members (or maybe lo/hi to avoid colliding with builtins?), with a Point like:

@attrs.define class Point: x: float y: float z: Optional[float] = None @classmethod def at(cls, xyz: Iterable[float]) -> Self: """Takes in a two- or three-element iterable of (x, y[, z]) coordinates.""" itr = iter(xyz) return Point(next(itr), next(itr), next(itr, None))

?

The abstraction is the BoundingBox class, adding a Point class makes the code more complicated and doesn't reflect that these are bounds.

thetorpedodog · 2024-02-16T16:43:00Z

src/tiledb/cloud/geospatial/ingestion.py

+    # common properties
+    extents: BoundingBox
+    crs: str = None
+    paths: Optional[Union[List[os.PathLike], os.PathLike]] = None


I think it would make more sense to have this be a one-element list (or tuple) in the case of there being one entry, rather than an os.PathLike object itself.

Using attrs, this can be accomplished with a converter:

def _wrap_paths(paths: Optional[Union[Sequence[os.PathLike], os.PathLike]]) -> Optional[Tuple[os.PathLike, ...]]: if paths is None: return None # Would it make sense to return () here, so it's always a Tuple[PathLike, ...]? if isinstance(paths, (str, bytes)) or not isinstance(paths, Sequence): # Editor's note: the isinstance has to go after the str-or-bytes check # because str and bytes are both Sequences. # (Worse yet, str is a Sequence[str].) return (paths,) return tuple(paths) # ... @attrs.define class GeoMetadata: # ... paths = attrs.field(converter=_wrap_paths, default=None)

This gives an interface where you can pass either a single path or a collection of paths, but keeps the type that is stored consistent.

thetorpedodog · 2024-02-16T16:45:46Z

src/tiledb/cloud/geospatial/ingestion.py

+def get_pointcloud_metadata(
+    sources: Iterable[os.PathLike],
+    *,
+    config: Mapping[str, Any] = None,


Since this is set to None, it should be optional (and as a small nit, when possible, use object instead of Any for “unspecified type”) so this can be config: Optional[Mapping[str, object]].

thetorpedodog · 2024-02-16T22:39:16Z

src/tiledb/cloud/geospatial/ingestion.py

+            raise ValueError("Require at least one point cloud file to have been found")
+    elif dataset_type == DatasetType.RASTER:
+        kwargs.update(
+            {


since the keys to this .update are all literals, this can be written as

kwargs.update( crs=meta.crs, ... )

thetorpedodog · 2024-02-16T22:40:47Z

src/tiledb/cloud/geospatial/ingestion.py

+    """
+
+    # Validate user input
+    if bool(search_uri) & bool(dataset_list_uri):


Should this be if bool(search_uri) != bool(dataset_list_uri)? Otherwise, this will work fine if neither is provided.

this is checking that not both search_uri and dataset_list_uri are provided. I will simplify this.

thetorpedodog · 2024-02-16T22:42:32Z

src/tiledb/cloud/utilities/_common.py

+T = TypeVar("T")
+
+
+def batch(items: Sequence[T], chunk_size: int) -> Iterator[Sequence[T]]:


To avoid the overloaded term batch, maybe chunked? Alternately, we could directly use more_itertools.chunked.

thetorpedodog · 2024-02-16T22:51:39Z

src/tiledb/cloud/utilities/_common.py

+    """
+    if isinstance(filter, tiledb.Filter):
+        filter_dict = filter._attrs_()
+        filter_dict["_name"] = type(filter).__name__


This is the same format we use in TileDB SOMA, except that we use the key _type rather than _name. Maybe useful to keep that consistent?

forgot to put the link here: https://github.com/single-cell-data/TileDB-SOMA/blob/db548eb0d8ef09c941993c56b394aa9cdd6d1777/apis/python/src/tiledbsoma/options/_tiledb_create_options.py#L30-L36

thetorpedodog · 2024-02-16T22:53:12Z

tests/test_geospatial.py

Since these tests can’t run with the basic TileDB install, it might be better to put them in a geospatial directory for easy exclusion, and so that it’s obvious that the datafiles belong with the geospatial test.

normanb · 2024-02-20T01:22:01Z

Added dynamic node expansion from the result of a previous UDF when ingesting the data.

Shelnutt2 · 2024-03-04T12:26:05Z

src/tiledb/cloud/geospatial/ingestion.py

+    :param dataset_list_uri: URI with a list of dataset URIs, defaults to None
+    :param max_files: maximum number of URIs to read/find,
+        defaults to None (no limit)
+    :param max_samples: maximum number of samples to ingest, defaults to None (no limit)


Whats the difference between max_files and max_samples?

Removed, this is leftover from copying the VCF template, thanks for spotting.

Shelnutt2 · 2024-03-04T12:26:17Z

src/tiledb/cloud/geospatial/ingestion.py

+        defaults to None
+    :param config: config dictionary, defaults to None
+    :param namespace: TileDB-Cloud namespace, defaults to None
+    :param register_name: name to register the dataset with on TileDB Cloud,


What is the default is none are passed?

the destination array is not registered, I have updated the docstring.

Shelnutt2 · 2024-03-04T12:43:12Z

src/tiledb/cloud/geospatial/ingestion.py

+    ).depends_on(process_node)
+
+    # Register the dataset on TileDB Cloud
+    if register_name:


Is there a reason we are not writing directly to tiledb:// URI? There shouldn't need to be a register step because we should create it through TileDB Cloud directly.

The dataset_uri can be a tiledb:// uri and work. This way we can write directly to object storage if needed. This is following VCF but happy to restrict this to just tiledb:// URIs.

Shelnutt2 · 2024-03-04T12:43:32Z

src/tiledb/cloud/geospatial/ingestion.py

+            register_name=register_name,
+            config=config,
+            verbose=verbose,
+            trace=trace,


This function doesn't take the trace parameter

Shelnutt2 · 2024-03-04T12:44:00Z

src/tiledb/cloud/geospatial/ingestion.py

+            config=config,
+            verbose=verbose,
+            trace=trace,
+            log_uri=log_uri,


This function doesn't take the log_uri parameter

Shelnutt2 · 2024-03-04T12:44:29Z

src/tiledb/cloud/geospatial/ingestion.py

+            verbose=verbose,
+            trace=trace,
+            log_uri=log_uri,
+            access_credentials_name=acn,


This function doesn't take access_credentials_name, it likely should be added to support setting credentials_name if this function remains.

Shelnutt2 · 2024-03-04T12:47:24Z

src/tiledb/cloud/geospatial/ingestion.py

+
+    # Register the dataset on TileDB Cloud
+    if register_name:
+        register_dataset_udf(


If we want to keep this function, it should be done as part of the task graph? It currently is called locally in the caller's environment (and fails because there is no context setup/usage), and it is called before the task graph runs.

Yes, this should be part of the graph, I am fixing all of the comments above related to this function.

Shelnutt2 · 2024-03-04T13:26:18Z

src/tiledb/cloud/geospatial/ingestion.py

+    """Groups input URIs into batches.
+    :param dataset_uri: dataset URI
+    :param dataset_type: dataset type, one of pointcloud, raster or geometry
+    :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type),


This can be removed as its not used in the function itself.

Shelnutt2 · 2024-03-04T13:26:35Z

src/tiledb/cloud/geospatial/ingestion.py

+    input_list_node = graph.submit(
+        build_inputs_udf,
+        dataset_type=dataset_type,
+        acn=acn,


This should be access_credentials_name=acn so the batch task graph is setup correctly.

Shelnutt2 · 2024-03-04T13:37:45Z

src/tiledb/cloud/geospatial/ingestion.py

+                sources = find(
+                    search_uri,
+                    config=config,
+                    excludes=ignore,


Typo should be exclude?

Shelnutt2 · 2024-03-04T13:38:41Z

src/tiledb/cloud/geospatial/ingestion.py

+                    search_uri,
+                    config=config,
+                    excludes=ignore,
+                    includes=pattern if pattern else fns[dataset_type]["pattern_fn"],


Typo should be include?

Shelnutt2 · 2024-03-04T13:41:48Z

src/tiledb/cloud/geospatial/ingestion.py

+                    config=config,
+                    excludes=ignore,
+                    includes=pattern if pattern else fns[dataset_type]["pattern_fn"],
+                    max_files=max_files,


I believe the parameter to find is max_count?

yes, it is, I missed this change when writing the find function.

Shelnutt2 · 2024-03-04T21:52:59Z

src/tiledb/cloud/utilities/_common.py

+                    continue
+
+                if vfs.is_dir(f):
+                    yield list_files(


This should be yield from and I believe we want to call find not list_files to recurse into the sub-folders. This means we also need to pass all the parameters, config=config, max_count=max_count.

I will add a test case. For find and not list_files, we retain the current count as a check in the outer function. With list_files I have removed the extraneous arguments.

Shelnutt2 · 2024-03-04T22:32:44Z

src/tiledb/cloud/geospatial/ingestion.py

+                    offsets = _fold_in(min, offsets, hdr.offsets)
+
+            return GeoMetadata(
+                paths=sources,


This needs to be list(sources)

Actually we should convert source to list before we consume the generator. something like:

paths = list(sources) for f in paths: ... ... .. return GeoMetadata( paths=paths ... ...

Shelnutt2 · 2024-03-04T22:34:19Z

src/tiledb/cloud/geospatial/ingestion.py

+                raise ValueError(f"No {dataset_type.name} datasets found")
+
+            meta_kwargs = fns[dataset_type]["kwargs"]
+            meta = fns[dataset_type]["meta_fn"](sources, config=config, **meta_kwargs)


So in testing this on some larger scale use cases. Getting the metadata in-line is a problem. I believe we should batch the output into jobs of 10 or 100 files, and grab metadata in parallel in another stage of the task graph. I've seen it take over 30m when there is 6000 files.

I think we should change this after this PR has been merged. There are a couple of scalability improvements we can make.

Also just noting that geospatial datasets tend to be larger as opposed to thousands of files. We should anticipate the problem but I would like to make this change as an incremental PR.

Sure thing, thanks!

Shelnutt2 · 2024-03-04T23:29:46Z

src/tiledb/cloud/geospatial/ingestion.py

+    # schema creation node, returns a sequence of work items
+    ingest_node = graph.submit(
+        fn,
+        **input_list_node,


input_list_node is a Node type, you can't deconstruct the dictionary that is returned from input_list_node like this. The dictionary needs to be passed in-intact and handles inside the functions.

normanb · 2024-03-08T23:01:45Z

The latest code has been tested with rasters and point clouds on TileDB Cloud.
pyopener is required for geometries in object storage which will be added in a separate PR but the code is stubbed out for geometry ingestion.
This PR inlines the new utility functions, after the PR is merged another PR will be issued to remove them and use the updated tiledb-cloud-py.

thetorpedodog

To set your tests up so that they only run when geospatial tests are called for:

Part 1 is already complete—you can use @pytest.mark.geospatial on your test cases that use geospatial code. This ensures that they don’t run when not called for. However, they are still collected as part of the test discovery process, which means the test Python files themselves should not need things installed.

I will add notes on the applicable files below, but don’t have time to do a full read-through of the codebase.

thetorpedodog · 2024-03-08T23:29:20Z

tests/common/test_utils.py

+        test_1 = [
+            self.fs.create_file("/var/data.dat"),
+            self.fs.create_file("/var/data/xx1.txt"),
+            self.fs.create_file("/var/data/xx2.txt"),
+        ]
+
+        with mock.patch.object(VFS, "ls", return_value=test_1):
+            with mock.patch.object(VFS, "is_dir", return_value=True) as mock_is_dir:
+                mock_is_dir.side_effect = lambda f: self.fs.isdir(f.name)


Though I did notice this while scrolling through: Instead of setting up a bunch of mocks and using a fake filesystem, it seems like it would be simpler to create a temporary directory:

with tempfile.TemporaryDirectory() as tmp_name: tmp = pathlib.Path(tmp_name) (tmp / "data.dat").write_text("test") data_dir = tmp / "data" data_dir.mkdir() (data_dir / "xx1.txt").write_text("test") ...

The temporary directory will be cleaned up at the end of the with-block, and there is no need to muck with TileDB internals.

I agree for find this is easier than mocking the file system.

thetorpedodog · 2024-03-08T23:31:31Z

tests/test_geospatial.py

+import affine
+import fiona
+import numpy as np
+import rasterio
+import shapely


You'll need to move the imports of these into only the methods where they are directly used, to avoid having them at the module level.

thetorpedodog · 2024-03-08T23:33:03Z

tests/test_geospatial.py

+from tiledb.cloud.geospatial import BoundingBox
+from tiledb.cloud.geospatial import DatasetType
+from tiledb.cloud.geospatial import build_inputs_udf
+from tiledb.cloud.geospatial import ingest_geometry_udf
+from tiledb.cloud.geospatial import ingest_point_cloud_udf
+from tiledb.cloud.geospatial import ingest_raster_udf
+from tiledb.cloud.geospatial import load_geometry_metadata
+from tiledb.cloud.geospatial import load_pointcloud_metadata
+from tiledb.cloud.geospatial import load_raster_metadata
+from tiledb.cloud.geospatial import read_uris


Likewise with this—importing tiledb.cloud.geospatial will end up importing those libraries by proxy, so within the function itself, import the items. To avoid all the repetition, I recommend importing the module rather than anything inside it, so…

thetorpedodog · 2024-03-08T23:34:50Z

tests/test_geospatial.py

+        # Ignore warnings
+        warnings.simplefilter("ignore")
+        # Create a temporary directory
+        self.test_dir = Path(tempfile.mkdtemp())


Another sidenote:

Something like

self.tempdir_obj = tempfile.TemporaryDirectory() self.test_dir = pathlib.Path(self.tempdir.name)

can be used here, then…

thetorpedodog · 2024-03-08T23:35:33Z

tests/test_geospatial.py

+
+    def tearDown(self):
+        # Remove the directory after the test
+        shutil.rmtree(self.test_dir)


…self.tempdir_obj.cleanup() will do all the cleanup work for you.

thetorpedodog · 2024-03-08T23:38:14Z

tests/test_geospatial.py

+        self.test_dir = Path(tempfile.mkdtemp())
+
+        def create_test_geometries(tmp_path: os.PathLike):
+            radius = 1.0


add import shapely to the top of this function rather than having that import at the module level.

Additionally, since this doesn’t appear to depend upon anything in its closure (i.e., the only thing it uses is the tmp_path passed in as a parameter), I recommend moving it to the module level so that the setUp function isn’t so long. Likewise with similar functions.

thetorpedodog · 2024-03-08T23:40:29Z

tests/test_geospatial.py

+                    json.dump(shapely.geometry.mapping(g), dst)
+
+        def create_test_rasters(tmp_path: os.PathLike):
+            kwargs = {


Likewise, at the top of this function, do import affine; import rasterio, etc.

This means that when this file is imported by pytest, it won’t immediately import any of the libraries, and by default all these tests will be skipped.

thetorpedodog · 2024-03-08T23:41:37Z

tests/test_geospatial.py

+        )
+
+
+class GeospatialTest(unittest.TestCase):


You can mark this test with @pytest.mark.geospatial, then

Comments were resolved

normanb assigned sgillies and ktsitsi Jan 30, 2024

sgillies reviewed Jan 30, 2024

View reviewed changes

src/tiledb/cloud/geospatial/helpers.py Outdated Show resolved Hide resolved

normanb assigned gspowley Feb 12, 2024

Shelnutt2 requested a review from thetorpedodog February 12, 2024 04:46

ktsitsi previously requested changes Feb 12, 2024

View reviewed changes

gspowley reviewed Feb 12, 2024

View reviewed changes

Norman Barker added 9 commits February 14, 2024 18:12

initial raster support

7569af1

pre-commit formatting hooks

3da11cf

removed geo helpers and made a shared batching module

4ebb2e2

renamed geoextents to BoundingBox and projection to crs

225e74e

optimized band iteration when merging

3e329e8

refactored geospatial to match VCF ingest

45224b0

point cloud ingestion

d60e52a

addressed pr comments

995980f

updated to use rasterio.merge with destination bounds

cae0932

normanb force-pushed the nb/geospatial-ingestion branch from f8eba0e to cae0932 Compare February 15, 2024 00:18

Norman Barker added 2 commits February 15, 2024 11:53

nodata test

e127c30

initial geometry support

94ac1b2

thetorpedodog reviewed Feb 16, 2024

View reviewed changes

Norman Barker added 2 commits February 19, 2024 15:05

refactoring of ingest after review

9b47f49

dynamic expand nodes and refactor

5b1114f

isolate unit testing

9c0c7ac

normanb force-pushed the nb/geospatial-ingestion branch from 4cc0134 to 9c0c7ac Compare February 20, 2024 16:36

added pixels per fragment for grouping writes for dense arrays

04b63b4

Shelnutt2 reviewed Mar 4, 2024

View reviewed changes

addressed pr comments

105b4ad

Shelnutt2 reviewed Mar 4, 2024

View reviewed changes

Norman Barker and others added 5 commits March 4, 2024 19:15

add test cases for find

8ee3b8f

handle list node type

fd8a4df

consumer the sources generator up-front

380346c

Merge branch 'main' into nb/geospatial-ingestion

98150e8

fixed typo with depends

44aaa03

ihnorton requested a review from JohnMoutafis March 7, 2024 20:41

Norman Barker added 3 commits March 8, 2024 10:07

point clouds with task graphs

37f4f86

raster ingest with task graphs

a3d8bf9

geometry ingest with task graphs

c52400f

thetorpedodog reviewed Mar 8, 2024

View reviewed changes

Norman Barker added 2 commits March 8, 2024 18:10

simplify find test to not use mock filesystem

121d43a

isolated tests for geospatial

51f9477

ihnorton requested a review from Shelnutt2 March 11, 2024 13:36

ktsitsi self-requested a review March 11, 2024 13:43

sgillies approved these changes Mar 13, 2024

View reviewed changes

force floating point coords in laspy created arrays

253425b

normanb merged commit a76221e into main Mar 19, 2024
17 checks passed

normanb deleted the nb/geospatial-ingestion branch March 19, 2024 19:03

		T = TypeVar("T")


		def batch(items: Sequence[T], chunk_size: int) -> Iterator[Sequence[T]]:

		BATCH_SIZE = 10


		@dataclass

Geospatial ingestion - WIP #507

Geospatial ingestion - WIP #507

Conversation

normanb commented Jan 30, 2024

sgillies left a comment

Choose a reason for hiding this comment

normanb commented Jan 30, 2024

normanb commented Feb 12, 2024 • edited

ktsitsi left a comment

Choose a reason for hiding this comment

normanb commented Feb 16, 2024

thetorpedodog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

normanb commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shelnutt2 Mar 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

normanb Mar 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

normanb commented Mar 8, 2024

thetorpedodog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

normanb commented Feb 12, 2024 •

edited

Shelnutt2 Mar 4, 2024 •

edited

normanb Mar 5, 2024 •

edited