ODC EP 010 Replace Configuration Layer

ODC-EP 10 - Replace configuration layer

Overview

A ground-up, compatibility-breaking rewrite of the configuration layer.

Proposed By

Paul Haesler (@SpacemanPaul)

State

Change is merged into develop-1.9 branch, but not yet released.

Motivation

The configuration layer in datacube-1.8 is complex, inconsistent and poorly documented. Further details on the behaviour in 1.8.x can be found in Issue #1258. Any effort to fully and accurately document the existing behaviour would likely result in confusing and unreadable documentation.

1.9 (and 2.0) is a good opportunity to retire accumulated technical debt and replace the existing code with something more consistent and maintainable without being weighed down by backwards compatibility.

Configuration features/quirks in v1.8.x

One or more ODC configuration files in INI File Format, implemented using the Python configparser library from the Python standard library.
Configuration files can contain:

A special "user" section specifying a default environment.
A named section per environment, where each environment can specify (1) which index driver to use and (2) any required connection information required for the database backend.

Ability to merge environments from multiple configuration files. (Inconsistently exposed. Available through the CLI but not directly through the Datacube() constructor.)
Default config search path and environments are defined if user supplies neither. Exact fall-back rules are convoluted.
Can inject config directly with environment variables. This behaviour is poorly documented and interacts inconsistently and/or unexpectedly with 3 and 4 above and some configuration items are not configurable with environment variables (in particular selecting an index driver other than the default).
$DATACUBE_CONFIG_PATH environment variable allows setting a single file location which sits at a fixed place in search path.
The configuration layer is only used for configuring the index backend. Other ODC configuration (e.g. AWS/S3/rasterio configuration, etc.) is handled separately.
The (undocumented) auto_config() function (also available through python -m datacube) writes out a config file based on the current configuration (which may have been merged from multiple files and environment variables)

Design Concerns

Single file vs multiple/merged config files

A multi-file implementation provides some desirable features for large centrally managed installation, e.g. NCI and (to a lesser extent) DEA Sandbox. However it can lead to confusion about where the current configuration is actually coming from, and makes the interaction between configuration from files and from environment variables unnecessarily complex.

Given that the confusing and complex nature of the 1.8.x implementation is a driving force behind this EP, a single file solution has been chosen. Large centrally managed installations should advise users to make a copy of the default configuration file and modify it, rather than creating a new configuration file that is read in conjunction with the default file.

A single file implementation also greatly reduces the usefulness of the (undocumented) auto_config() function.

Config file format

The Windows INI style config format used in 1.8 only supports a single layer of hierarchy, which places limits on what other (i.e. non-index-layer-specific) configuration can be added to the configuration layer.

Given the heavy use of YAML in other parts of the ODC codebase, a switch to a YAML-based configuration file format is worth considering.

Advantages of a switch to YAML include:

a) Can package config in a string without \n newlines everywhere. b) Arbitrary-depth nested hierarchies

Nested hierarchy is not needed for simply configuring index connections, which is all config is used for in 1.8.x. But in 1.8 we only have one global config for cloud access (e.g. AWS/S3) settings. It is not unreasonable to want to be able to store data which requires different AWS/S3 settings in the same index. STAC currently supports this, and we will need to support it to enable tighter STAC/ODC integration. Allowing per-index-AWS enviroment settings would be an improvement. STAC stores these per-"dataset", equivalent to storing with the data uri/location in ODC, but some sort of per-provider/bucket configuration option seems preferable - this would be extremely unwieldy to implement in an INI based deployment.

Both INI and (non-nested) YAML will be supported in 1.9. INI format may be deprecated in future when features are added that require deeper config nesting. Support for the INI format may subsequently be dropped all together.

N.B. Config file examples use a mixture of ini and yaml formats.

Interaction Between Environment Variables and Config Files.

Configuration via environment variables is essential in e.g. cloud-deployed environments where leaking of credentials is a serious risk, and is therefore a required feature.

The interaction between config files and environment variables in datacube-1.8 is quite complex and unexpected. E.g. environment variables are not used at all if a config file is explicitly specified, but are merged on top default config files.

It is important to consider that we now need to allow for multiple indexes to be in use at once. (datcube-1.8 database credentials passed in via environment variables are applied to all environments, effectively restricting access to a single database.)

Proposal

Single config file (no merging).
YAML and INI formats supported initially.
auto_config() function dropped.

A. Contents of configuration

A config file consists of environments. An environment may be configured independently, or can be defined as an alias to another existing environment.

The "user" section no longer has a special meaning (as it is no longer relevant when config files are not merged.)

; Comments in INI format start with a semicolon.
[default]
   alias: prod

[prod]
   db_hostname: prod.dbs.example.net
   db_database: odc_prod
   db_user: cube
   db_password: secret_squirrel
   db_connection_timeout: 60
   
[dev]
   index_driver: postgis
   db_hostname: dev.dbs.example.net
   db_database: odc_dev

   db_user: cube
   db_port: 5432
   db_iam_authentication: y
   db_iam_timeout: 300

[temp]
   index_driver: memory

Restrictions on environment names:

Can only contain alphanumeric characters. In particular, must not contain underscores or dashes.
First character in name must be alphabetic.
all alphabetic characters must be all lower case.
i.e. must match regex: ^[a-z][a-z0-9]*$

Restrictions on configuration fields:

Can only contain alphanumeric characters and underscores
First character in name must be alphabetic
All alphabetic characters must be all lower case
i.e. must match regex: ^[a-z][a-z0-9_]*$

(The restrictions are to support a systematic,consistent and reversible mapping between config options and environment variable names. See Section B.4 below.)

Configuring database details as a single database url (instead of separate hostname, port, database, username and password).

Some index drivers (initially the postgres and postgis index drivers) will support supplying connection details as a single connection url. If a url is provided, it overrides any individual connection fields (db_hostname, db_port, db_database, db_username and db_password) provided for that environment. The format of the database url will depend on the index driver, but for both postgres and postgis drivers will be:

postgresql://[username]:[password]@[hostname]:[port]/[database]

Or for passwordless access to a database on localhost:

postgresql:///[database]

E.g.

# YAML comments start with a hash/octothorpe symbol.
myenv:
    index_driver: postgis
    db_url: postgresql://user:insecure_password@hostname.domain:5432/mydb
    db_database: will_be_overridden
    db_password: this_is_not_used_either

is equivalent to

# Comments never affect behaviour
myenv:
    index_driver: postgis
    db_hostname: hostname.domain
    db_database: mydb
    db_username: user
    db_port: 5432
    db_password: insecure_password

The db_url can also be supplied in a generic environment variable (see Section B.4 below).

If db_url is supplied, the separate url components (e.g. db_username, db_database, etc) are exposed through the config interface as if they had been supplied separately. Note that the reverse is not true. Extracting the URL from a config environment that was configured through db_* components requires a separate function (datacube.cfg.psql_url_from_config()).

Possible future deprecation: deprecate (and then later remove) the db_* config entries for the postgres and postgis drivers in favour of the single url approach.

B. Config loading/reading process

1. Bypassing all configuration files (explicit config text OR dictionary)

Configuration file text may be supplied directly, without an actual on-disk config file. If configuration is supplied using these methods, no further config processing is performed, i.e. steps 2-4 below are skipped.

In Python: dc = Datacube(raw_config="[default]\ndb_hostname....")
Via CLI: datacube --raw-config "`config_file_generator --option blah`" (-R or --raw-config NEW)
Via Environment variable: ODC_CONFIG="`config_file_generator --option blag`"

CLI option or Datacube argument overrides environment variable $ODC_CONFIG. If none of the above are provided, on-disk files and/or environment variables are read, as per the steps described below.

Additionally, for Python access only, a configuration dictionary may be passed in (not serialised into a text string). This treated as equivalent to supplying config text and no further config processing is performed.

dc = Datacube(raw_config={
   "default": {
       "db_hostname": "localhost", "db_port": 5432, ...
   }
})

2. File Finder

If explicit config text was not provided, we need to find a config file in the file system.

This design is a one-file-only design.

2a. Explicit file locations

Either as a single path:

In Python: dc = Datacube(config="/path/to/configfile")
Via CLI: datacube -C /path/to/configfile (-C or --config)
Via Environment Variable: ODC_CONFIG_FILE=/path/to/configfile
Via Legacy Environment Variable: DATACUBE_CONFIG_PATH (with deprecation and behaviour change warning)

Or a priority list of paths:

In Python: dc = Datacube(config=['/path/to/override_config', '/path/to/default_config']) NEW
Via CLI: datacube -C /path/to/override_config -C /path/to/default_config
Via Environment Variable (like a UNIX PATH): ODC_CONFIG_PATH=/path/to/override_config:/path/to/default_config NEW
Via Legacy Environment Variable (like a UNIX PATH): DATACUBE_CONFIG_PATH (with deprecation and behaviour change warning) NEW (but still deprecated)

The possible locations are searched in the order provided and the first to exist in the file system is used. No merging is performed.

If config locations are provided and none of the files exist, an error is raised.

2b. Default file locations.

If no config file locations are provided, the following default priority path list is used. (The first in the list found is used, again no merging is performed.)

datacube.conf (i.e in the current working directory).
~/.datacube.conf (i.e. in the current user's home directory)
/etc/default/datacube.conf NEW
/etc/datacube.conf

If no config file locations are provided, and none of the above exist, a minimal default config (datacube.cfg.cfg._DEFAULT_CONFIG) is used.

3. Choosing which environment to use.

3a. Explicitly provided environment

The user may explicitly specify an environment:

In Python: dc = Datacube(env="dev")
Via CLI: datacube -E dev (--env or -E)
Via Environment Variable: ODC_ENVIRONMENT=dev
Via Legacy Environment Variable: DATACUBE_ENVIRONMENT (with deprecation warning)

Environment variables are only read if environment not explicitly passed in by Python or CLI.

Note that the env argument to Datacube() can take an explicit ODCEnvironment object instead of a string.

3b. Default behaviour when no environment is explicitly specified.

The default environment is "default".
If there is no environment (or environment alias) called "default", then the "datacube" environment is used if it exists (with a deprecation warning.)
If neither default or datacube environments exist (and no environment is explicitly specified) a second attempt is made to use the "default" environment. This allows connection parameters to be specified purely with legacy $DB_* environment variables with no actual configuration file without having to explicitly supply the environment name.

The "default_environment" setting in the "user" section of the config file is no longer supported because it doesn't make sense in the absence of file merging (and it makes the contents of the config file simpler and more consistent). If the config file contains this entry a warning is issued.

4. Config via Generic Config Environment Variables

Any configuration field not in the active config file can be supplied by (or any field in the active config file overridden by) a generic config environment variable named:

$ODC_[environment_name_or_alias]_[field_name]

Both names/aliases are converted to upper case for the environment variable name.

E.g. Given the following contents of the active config file:

[default]
   alias: prod

[prod]
   db_hostname: prod.dbs.example.net
   db_database: odc_prod
   db_username: odc
   db_password: insecure_passwd1

[dev]
   db_hostname: dev.dbs.example.net
   db_database: odc_dev
   db_username: odc

[temp]
   index_driver: memory

AND the following environment variable values:

# This could be specified as ODC_DEFAULT_DB_PASSWORD or ODC_PROD_DB_PASSWORD
# If both are supplied the non-alias one (ODC_PROD_DB_PASSWORD) takes precedence.
ODC_DEFAULT_DB_PASSWORD=secret_and_secure
ODC_PROD_DB_HOSTNAME=production.dbs.internal

ODC_DEV_DB_IAM_AUTHENTICATION=y
ODC_DEV_DB_IAM_TIMEOUT=3600

ODC_DYNENV_DB_HOSTNAME=another.dbs.example.com
ODC_DYNENV_DB_USERNAME=odc
ODC_DYNENV_DB_PASSWORD=secure_and_secret
ODC_DYNENV_DB_DATABASE=other

Then the effective value of the configuration is:

[default]
   alias: prod

[prod]
   db_hostname: production.dbs.internal
   db_database: odc_prod
   db_username: odc
   db_password: secret_and_secure

[dev]
   db_hostname: dev.dbs.example.net
   db_database: odc_dev
   db_username: odc
   db_iam_authentication: y
   db_iam_timeout: 3600

[temp]
   index_driver: memory

[dynenv]
   db_hostname: another.dbs.example.com
   db_username: odc
   db_password: secure_and_secret
   db_database: other

Notes:

Operationally the config layer will only know about the dynenv environment if the user explicitly requests it.
Although new environments can be defined dynamically with environment variables, creating or overriding aliases with environment variables will be forbidden as it creates too many implementation-specific corner-cases in behaviour.
The legacy $DB_DATABASE, $DB_HOSTNAME, $DB_PASSWORD, etc. environment variables will be applied to ALL environments, with a deprecation warning.
The database url (as discussed above) can be passed in by environmnet variable: ODC_MYENV_DB_URL=postgresql://user:insecure_password@hostname.domain:5432/mydb The legacy $DATACUBE_DB_URL environment variable will be applied to ALL environments with a deprecation warning.
If a config entry for an environment is overridden by multiple environment variables named using different the canonical name and using an environment alias, then the environment variable using the canonical name is used.
If environment variables for multiple environment aliases, but not the canonical environment name is present, then only matching environment will be used. Which is chosen is arbitrary and may change between releases.

Consistency

All entry points will use a consistent API for resolving configuration information. This API will be exposed and documented for reuse (e.g. to allow determining database connection details from non-core code).

A brief overview of the API:

# Reading in configuration

cfg = ODCConfig()                                      # Use default/environment variable-defined config.
cfg = ODCConfig(text="'default':{'db_hostname':...")   # Use provided text as config file.
cfg = ODCConfig(raw_dict={                           # Use provided dictionary as config.
     "default": {
          "db_hostname": ... 
     }
})
cfg = ODCConfig(paths="/path/to/file")                  # Read from a file system path
cfg = ODCConfig(paths=[                                 # Read from first file found from a list of file system paths
     "/path/to/file",
     "path/to/another/file"
])

# Accessing configuration

url_for_env = cfg["dev"].db_url
url_for_default_env = cfg[None].db_url

# Initialise Datacube object from an ODCEnvironment:

dc = Datacube(env=cfg["dev"])

Feedback

Damien Ayers (2023-04-21)

Paul and I have discussed this EP prior to it's drafting. Given the complexity and limitations of the current configuration system, my feeling is that we should scrap the implementation of the current system, and clearly define a simpler system before implementing it.

On the table for discussion,

Should we look for configuration files in multiple places? I think that this is worth having, so yes.
For a simple system, I think we're much better off with INI style than YAML.

Specific points:

I think we should ditch the multiple file overlay system. It's too hard to reason about.
Requirements: we must allow configuration via Environment Variables as well as via a file.

Paul's responses to Damien's comments above

All accepted and merged, except INI vs YAML - I still find the arguments I give above compelling, particularly re: better STAC interoperability.

Matt Paget (2023-04-28)

Looks great! Some comments:

The environment variables could potentially get messy with lots of variants for different ODC environments. For a system/deployment admin, the new /etc/default/datacube.conf could be more suitable (e.g., a file managed by puppet etc).
- I might suggest that the docs could present the ODC_DEFAULT_* env vars as an available fallback (as noted above). Then mention that other ODC_[environment]_* env vars can be used too but with a note of caution that /etc/default/datacube.conf might be more suitable for administrators.
It would be helpful to expose the datacube config reconciling function(s) so that the resulting (db) values can be used by ODC repos and custom code. Perhaps the "API" aspect of the config reconciling could be described above as well?

Paul's responses to Matt's comments above

/etc/default/datacube.conf only makes sense for managed environments. We want to support that, but where a user wants to create and manage their own config, we want to give them that flexibility without them having to constantly think about the system-wide config as well as their own. But yes, happy for the documentation to recommend particular approaches for the contexts they are best suited for.
Extracting db values from config (in a way consistent with ODC's behaviour) is a use case I hadn't thought about - I'll keep that in mind.

Voting

Enhancement Proposal Team

Paul Haesler (@SpacemanPaul)

Links

Welcome to the Open Data Cube

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly