Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ditch databases? #172

Open
dionhaefner opened this issue Mar 31, 2020 · 16 comments
Open

Ditch databases? #172

dionhaefner opened this issue Mar 31, 2020 · 16 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@dionhaefner
Copy link
Collaborator

dionhaefner commented Mar 31, 2020

Currently, we have the following tables in a Terracotta database:

  • keys: Contains defined key names and their description
  • datasets: Maps key values to physical raster path
  • metadata: Maps key values to raster metadata (such as min and max value, bounds, footprint, ...)

An alternative model could be to save raster metadata on the rasters themselves. In that case, it would be much less likely for the raster metadata to go out of date. Having the metadata in a database only makes sense if we want to search it, which we currently don't allow.

Doing this would even allow us to decouple the database from Terracotta entirely. We could then have an external database that the frontend can query for valid rasters, and request them from Terracotta by filename. This gives users flexibility to have a searchable catalogue outside of TC, and we could recover the current behavior by running directory listings on the raster folder.

@dionhaefner dionhaefner added enhancement New feature or request question Further information is requested labels Mar 31, 2020
@j08lue
Copy link
Collaborator

j08lue commented Mar 31, 2020

I am split about this:

  • Pro ditch: The database is what makes deployment cumbersome - you need to make sure you update the DB in sync with the data and manage access e.g. to a MySQL db. Just having to push files and get their names right is easier and would move all the keys logic to the regex / pattern matching schema.
  • Pro keep: I always find file names and folder structures an ugly implementation detail that I am quite happy Terracotta hides by default from the API user.

Having the metadata in a database only makes sense if we want to search it, which we currently don't allow.

We could also decide to strengthen the DB feature by adding query functionality. But then we open the box of people wanting to query very differently.

An additional model could be to save raster metadata on the rasters themselves

Like in GeoTIFF tags? I assume you could still retrieve the metadata reasonably fast from the tags and quickly enough build the path -- metadata mapping for subsequent filtering, don't you think? Its completely parallelizable, at least.

@dionhaefner
Copy link
Collaborator Author

Well, if you do it right, the user would never know they are looking at paths.

Before:

example.com/singleband/20180101/25XEL/B05/<z>/<x>/<y>.png

After:

example.com/singleband/20180101/25XEL/B05/<z>/<x>/<y>.png

Just have the folder structure

rasters/
├─ 20180101/
|  └─ 25XEL/
|    └─ B05.tif
├─ 20180102/
└─ 20180103/

And in my proposal, you would still have the option to use a (fully searchable) database, it just wouldn't contain the raster metadata by default. So you give either TC_DB_TYPE=filesystem or TC_DB_TYPE=sqlite or so.

Like in GeoTIFF tags?

Yes.

@mrpgraae
Copy link
Collaborator

mrpgraae commented Apr 1, 2020

An additional model could be to save raster metadata on the rasters themselves. In that case, it would be much less likely for the raster metadata to go out of date.

This is a good idea in any case.

and we could recover the current behavior by running directory listings on the raster folder.

This would mean that keys would have to match the file/folder structure directly, would that not be a step backwards? I guess it could be good as a default behavior, then a DB is only needed if the user wants to have arbitrary key->raster mappings.

@mrpgraae
Copy link
Collaborator

mrpgraae commented Apr 1, 2020

But the DB would still be "managed" by Terracotta, right?
As in tc would still provide the tools for setting up a DB, adding keys and so on?

@dionhaefner
Copy link
Collaborator Author

Yes, so you have pattern matching on the file system by default, and the option to use a database. Database could be anything that maps your search criteria to what we call keys now (and might call "path" or just "key" in a new version).

Example I have in mind: a STAC catalogue of all your rasters, you index that via SQL-like queries, and by that obtain the key you can use to query TC for the tiles.

@dionhaefner
Copy link
Collaborator Author

But the DB would still be "managed" by Terracotta, right?
As in tc would still provide the tools for setting up a DB, adding keys and so on?

I'm not sure. I think it would be good to do that for the most common DB usecase(s), but we should also have the option to have databases that are externally managed.

@mrpgraae
Copy link
Collaborator

mrpgraae commented Apr 1, 2020

I'm not sure. I think it would be good to do that for the most common DB usecase(s), but we should also have the option to have databases that are externally managed.

Reason I'm asking is that if we don't have these tools, users would then have to tinker with a database, if they want to have the same functionality as now.
I imagine that people like using Terracotta, because it's really easy to set up and use, I would be careful about messing with that too much.

@dionhaefner
Copy link
Collaborator Author

Yea, we could still fully support e.g. the simple SQLite DB, and make any external DB read-only.

@dionhaefner
Copy link
Collaborator Author

Not sure though in what way the SQLite DB would be easier to use than the filesystem. Ingest == dump the file into an S3 bucket sounds like the dream to me.

@mrpgraae
Copy link
Collaborator

mrpgraae commented Apr 1, 2020

Not sure though in what way the SQLite DB would be easier to use than the filesystem. Ingest == dump the file into an S3 bucket sounds like the dream to me.

Yes, it's very plausible that the people who would want the SQLite DB is a small minority, so it might be fully justified to make their lives a tiny bit harder, for the greater good.
But there could be people who don't want to / can't use S3 for some reason.

@dionhaefner
Copy link
Collaborator Author

Well, the data has to live somewhere? S3 was just an example, could be any filesystem.

@dionhaefner
Copy link
Collaborator Author

Aha, S3 could actually be an example where you would want to use a database. You can't run glob patterns on S3 buckets, so if you have tens of thousands of files it could become slow to filter those on /datasets queries.

@mrpgraae
Copy link
Collaborator

mrpgraae commented Apr 1, 2020

Well, the data has to live somewhere? S3 was just an example, could be any filesystem.

Right, I'm just thinking out loud, trying to cover every angle from the users perspective. Imagining how this would work from the perspective of the most unlikely user, who wants to have their rasters scattered all over the place 😄

But yeah, decoupling from the DB would probably make it better for the vast majority of users, at the possible cost of making it slightly worse for a small minority.

Computing the metadata would then happen during optimize-rasters I guess? That would make optimizing mandatory (unless the user wants lazy metadata computation). Perhaps optimize-rasters should then be prepare-rasters and the actual optimization can be turned off with a flag.

Aha, S3 could actually be an example where you would want to use a database. You can't run glob patterns on S3 buckets, so if you have tens of thousands of files it could become slow to filter those on /datasets queries.

Okay, that changes things a bit, I guess. Using Lambdas and S3 should not require any external DB tinkering, right?

@dionhaefner
Copy link
Collaborator Author

dionhaefner commented Apr 1, 2020

Simplest mode of usage:

(only recommended for data exploration)

$ terracotta serve -r myrasters/{date}/{tile}/{band}.tif

In this case

  • /keys just gives [date, tile, band] with empty descriptions
  • /datasets runs glob with the given pattern and extracts / filters keys
  • /metadata opens GTiff and reads tags, if there are no tags metadata is computed sloppily based on an overview
  • Metadata access triggers warning about missing tags and possibly missing cloud-optimization

Slightly more advanced:

(recommended usage)

# This optimizes the files and dumps them into the S3 bucket when done
$ terracotta prepare-rasters myrasters/**/*.tif -o s3://myrasters
$ export TC_KEY_DESC="s3://key_desc.json"
$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif
  • /keys reads descriptions from given JSON file
  • /datasets retrieves files list of entire bucket and filters it according to query parameters
  • /metadata reads from GTiff tags

Advanced:

$ terracotta prepare-rasters myrasters/**/*.tif -o s3://myrasters
$ terracotta ingest s3://myrasters/{date}/{tile}/{band}.tif -o s3://myrasters/tc.sqlite
$ export TC_KEY_DESC="s3://key_desc.json"
$ terracotta serve -d s3://myrasters/tc.sqlite
  • /datasets now runs efficient queries on the SQLite database

Custom paths to rasters:

(keys are not coupled to file paths)

Same as before, but use Python API to create SQLite database

External database:

Option 1a

$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external
  • /keys and /datasets are disabled (TC is just used for serving metadata and tiles)
  • User supplies an external discovery API that maps searchable parameters to Terracotta's keys (or just one ID key)
  • Discovery API could be a Terracotta plugin / related project

Option 1b

$ terracotta serve --external
  • /keys and /datasets are disabled (TC is just used for serving metadata and tiles)

  • User supplies an external discovery API that maps searchable parameters to file paths

  • All raster queries require full paths:

    $ curl example.com/metadata?path=s3://myrasters/20180101/25XEL/B05.tif

Option 2a

$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external mysql://example.com:123456
  • Enforce some sort of database structure

  • Ingestion and creation could go through Terracotta

  • /keys returns searchable fields

  • /datasets does equality checks on searchable fields and returns keys used for raster retrieval:

    $ curl example.com/datasets?sensor=S2
    [{date: 20120101, band:B04, tile:25XEL}]  # note: no sensor here!

Option 2b

$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external mysql://example.com:123456
  • Enforce some sort of database structure

  • Ingestion and creation could go through Terracotta

  • /keys returns searchable fields

  • /datasets supports SQL SELECT syntax:

    $ curl example.com/datasets?where="sensor == 'S2' AND tile == '25XEL' AND cloudcover<=95"
    [{date: 20120101, band:B04, tile:25XEL}] 

@dionhaefner
Copy link
Collaborator Author

Option 2x look nice but they are a bit half-baked, since there's no real way to browse the database...

@j08lue
Copy link
Collaborator

j08lue commented Apr 1, 2020

so if you have tens of thousands of files it could become slow to filter those on /datasets queries

Slower than a database lookup for sure, but listing keys in a bucket and running some regex on them is not that slow either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants