Ditch databases? #172

dionhaefner · 2020-03-31T14:11:42Z

Currently, we have the following tables in a Terracotta database:

keys: Contains defined key names and their description
datasets: Maps key values to physical raster path
metadata: Maps key values to raster metadata (such as min and max value, bounds, footprint, ...)

An alternative model could be to save raster metadata on the rasters themselves. In that case, it would be much less likely for the raster metadata to go out of date. Having the metadata in a database only makes sense if we want to search it, which we currently don't allow.

Doing this would even allow us to decouple the database from Terracotta entirely. We could then have an external database that the frontend can query for valid rasters, and request them from Terracotta by filename. This gives users flexibility to have a searchable catalogue outside of TC, and we could recover the current behavior by running directory listings on the raster folder.

The text was updated successfully, but these errors were encountered:

j08lue · 2020-03-31T19:19:31Z

I am split about this:

Pro ditch: The database is what makes deployment cumbersome - you need to make sure you update the DB in sync with the data and manage access e.g. to a MySQL db. Just having to push files and get their names right is easier and would move all the keys logic to the regex / pattern matching schema.
Pro keep: I always find file names and folder structures an ugly implementation detail that I am quite happy Terracotta hides by default from the API user.

Having the metadata in a database only makes sense if we want to search it, which we currently don't allow.

We could also decide to strengthen the DB feature by adding query functionality. But then we open the box of people wanting to query very differently.

An additional model could be to save raster metadata on the rasters themselves

Like in GeoTIFF tags? I assume you could still retrieve the metadata reasonably fast from the tags and quickly enough build the path -- metadata mapping for subsequent filtering, don't you think? Its completely parallelizable, at least.

dionhaefner · 2020-04-01T08:19:08Z

Well, if you do it right, the user would never know they are looking at paths.

Before:

example.com/singleband/20180101/25XEL/B05/<z>/<x>/<y>.png

After:

example.com/singleband/20180101/25XEL/B05/<z>/<x>/<y>.png

Just have the folder structure

rasters/
├─ 20180101/
|  └─ 25XEL/
|    └─ B05.tif
├─ 20180102/
└─ 20180103/

And in my proposal, you would still have the option to use a (fully searchable) database, it just wouldn't contain the raster metadata by default. So you give either TC_DB_TYPE=filesystem or TC_DB_TYPE=sqlite or so.

Like in GeoTIFF tags?

Yes.

mrpgraae · 2020-04-01T08:24:15Z

An additional model could be to save raster metadata on the rasters themselves. In that case, it would be much less likely for the raster metadata to go out of date.

This is a good idea in any case.

and we could recover the current behavior by running directory listings on the raster folder.

This would mean that keys would have to match the file/folder structure directly, would that not be a step backwards? I guess it could be good as a default behavior, then a DB is only needed if the user wants to have arbitrary key->raster mappings.

mrpgraae · 2020-04-01T08:26:36Z

But the DB would still be "managed" by Terracotta, right?
As in tc would still provide the tools for setting up a DB, adding keys and so on?

dionhaefner · 2020-04-01T08:29:55Z

Yes, so you have pattern matching on the file system by default, and the option to use a database. Database could be anything that maps your search criteria to what we call keys now (and might call "path" or just "key" in a new version).

Example I have in mind: a STAC catalogue of all your rasters, you index that via SQL-like queries, and by that obtain the key you can use to query TC for the tiles.

dionhaefner · 2020-04-01T08:31:02Z

But the DB would still be "managed" by Terracotta, right?
As in tc would still provide the tools for setting up a DB, adding keys and so on?

I'm not sure. I think it would be good to do that for the most common DB usecase(s), but we should also have the option to have databases that are externally managed.

mrpgraae · 2020-04-01T08:36:04Z

I'm not sure. I think it would be good to do that for the most common DB usecase(s), but we should also have the option to have databases that are externally managed.

Reason I'm asking is that if we don't have these tools, users would then have to tinker with a database, if they want to have the same functionality as now.
I imagine that people like using Terracotta, because it's really easy to set up and use, I would be careful about messing with that too much.

dionhaefner · 2020-04-01T08:57:31Z

Yea, we could still fully support e.g. the simple SQLite DB, and make any external DB read-only.

dionhaefner · 2020-04-01T09:07:38Z

Not sure though in what way the SQLite DB would be easier to use than the filesystem. Ingest == dump the file into an S3 bucket sounds like the dream to me.

mrpgraae · 2020-04-01T09:22:07Z

Not sure though in what way the SQLite DB would be easier to use than the filesystem. Ingest == dump the file into an S3 bucket sounds like the dream to me.

Yes, it's very plausible that the people who would want the SQLite DB is a small minority, so it might be fully justified to make their lives a tiny bit harder, for the greater good.
But there could be people who don't want to / can't use S3 for some reason.

dionhaefner · 2020-04-01T09:35:47Z

Well, the data has to live somewhere? S3 was just an example, could be any filesystem.

dionhaefner · 2020-04-01T09:41:10Z

Aha, S3 could actually be an example where you would want to use a database. You can't run glob patterns on S3 buckets, so if you have tens of thousands of files it could become slow to filter those on /datasets queries.

mrpgraae · 2020-04-01T09:50:54Z

Well, the data has to live somewhere? S3 was just an example, could be any filesystem.

Right, I'm just thinking out loud, trying to cover every angle from the users perspective. Imagining how this would work from the perspective of the most unlikely user, who wants to have their rasters scattered all over the place 😄

But yeah, decoupling from the DB would probably make it better for the vast majority of users, at the possible cost of making it slightly worse for a small minority.

Computing the metadata would then happen during optimize-rasters I guess? That would make optimizing mandatory (unless the user wants lazy metadata computation). Perhaps optimize-rasters should then be prepare-rasters and the actual optimization can be turned off with a flag.

Aha, S3 could actually be an example where you would want to use a database. You can't run glob patterns on S3 buckets, so if you have tens of thousands of files it could become slow to filter those on /datasets queries.

Okay, that changes things a bit, I guess. Using Lambdas and S3 should not require any external DB tinkering, right?

dionhaefner · 2020-04-01T10:24:38Z

Simplest mode of usage:

(only recommended for data exploration)

$ terracotta serve -r myrasters/{date}/{tile}/{band}.tif

In this case

/keys just gives [date, tile, band] with empty descriptions
/datasets runs glob with the given pattern and extracts / filters keys
/metadata opens GTiff and reads tags, if there are no tags metadata is computed sloppily based on an overview
Metadata access triggers warning about missing tags and possibly missing cloud-optimization

Slightly more advanced:

(recommended usage)

# This optimizes the files and dumps them into the S3 bucket when done
$ terracotta prepare-rasters myrasters/**/*.tif -o s3://myrasters
$ export TC_KEY_DESC="s3://key_desc.json"
$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif

/keys reads descriptions from given JSON file
/datasets retrieves files list of entire bucket and filters it according to query parameters
/metadata reads from GTiff tags

Advanced:

$ terracotta prepare-rasters myrasters/**/*.tif -o s3://myrasters
$ terracotta ingest s3://myrasters/{date}/{tile}/{band}.tif -o s3://myrasters/tc.sqlite
$ export TC_KEY_DESC="s3://key_desc.json"
$ terracotta serve -d s3://myrasters/tc.sqlite

/datasets now runs efficient queries on the SQLite database

Custom paths to rasters:

(keys are not coupled to file paths)

Same as before, but use Python API to create SQLite database

External database:

Option 1a

$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external

/keys and /datasets are disabled (TC is just used for serving metadata and tiles)
User supplies an external discovery API that maps searchable parameters to Terracotta's keys (or just one ID key)
Discovery API could be a Terracotta plugin / related project

Option 1b

$ terracotta serve --external

/keys and /datasets are disabled (TC is just used for serving metadata and tiles)
User supplies an external discovery API that maps searchable parameters to file paths

All raster queries require full paths:

$ curl example.com/metadata?path=s3://myrasters/20180101/25XEL/B05.tif

Option 2a

$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external mysql://example.com:123456

Enforce some sort of database structure
Ingestion and creation could go through Terracotta
/keys returns searchable fields

/datasets does equality checks on searchable fields and returns keys used for raster retrieval:

$ curl example.com/datasets?sensor=S2
[{date: 20120101, band:B04, tile:25XEL}]  # note: no sensor here!

Option 2b

$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external mysql://example.com:123456

Enforce some sort of database structure
Ingestion and creation could go through Terracotta
/keys returns searchable fields

/datasets supports SQL SELECT syntax:

$ curl example.com/datasets?where="sensor == 'S2' AND tile == '25XEL' AND cloudcover<=95"
[{date: 20120101, band:B04, tile:25XEL}]

dionhaefner · 2020-04-01T10:55:21Z

Option 2x look nice but they are a bit half-baked, since there's no real way to browse the database...

j08lue · 2020-04-01T13:05:44Z

so if you have tens of thousands of files it could become slow to filter those on /datasets queries

Slower than a database lookup for sure, but listing keys in a bucket and running some regex on them is not that slow either.

dionhaefner added enhancement New feature or request question Further information is requested labels Mar 31, 2020

j08lue mentioned this issue Apr 27, 2021

Flask is erroring out with BrokenProcessPool #205

Closed

dionhaefner mentioned this issue Dec 1, 2022

Support multi-band GeoTiffs #283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ditch databases? #172

Ditch databases? #172

dionhaefner commented Mar 31, 2020 •

edited

j08lue commented Mar 31, 2020 •

edited

dionhaefner commented Apr 1, 2020

mrpgraae commented Apr 1, 2020

mrpgraae commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

mrpgraae commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

mrpgraae commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

mrpgraae commented Apr 1, 2020 •

edited

dionhaefner commented Apr 1, 2020 •

edited

dionhaefner commented Apr 1, 2020

j08lue commented Apr 1, 2020

Ditch databases? #172

Ditch databases? #172

Comments

dionhaefner commented Mar 31, 2020 • edited

j08lue commented Mar 31, 2020 • edited

dionhaefner commented Apr 1, 2020

mrpgraae commented Apr 1, 2020

mrpgraae commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

mrpgraae commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

mrpgraae commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

dionhaefner commented Apr 1, 2020

mrpgraae commented Apr 1, 2020 • edited

dionhaefner commented Apr 1, 2020 • edited

Simplest mode of usage:

Slightly more advanced:

Advanced:

Custom paths to rasters:

External database:

Option 1a

Option 1b

Option 2a

Option 2b

dionhaefner commented Apr 1, 2020

j08lue commented Apr 1, 2020

dionhaefner commented Mar 31, 2020 •

edited

j08lue commented Mar 31, 2020 •

edited

mrpgraae commented Apr 1, 2020 •

edited

dionhaefner commented Apr 1, 2020 •

edited