Skip to content

schema management tool for PostgreSQL, using pgcmp to find discrepancies

Notifications You must be signed in to change notification settings

cbbrowne/mahout

Repository files navigation

Mahout Schema Manager

What is Mahout? Mahout is a schema manager intended to ease the process of automating the deployment of database schema changes for PostgreSQL.

Deploying schema changes may be very simple if you are merely managing a “personal” database where it is reasonable to deploy changes in-place.

On the other hand, when deploying database-based applications into sophisticated production environments, it is crucial to be able to consistently duplicate the process of deploying schemas and upgrades in development, test, and production environments. That is what Mahout tries to make a bit easier.

Other Schema Managers

There are a number of toolsets out there that do notionally similar things.

Sqitch
Is a Perl-based system that supports building schema upgrade scripts for PostgreSQL databases.
Liquibase
Supports a number of databases, implemented in Java.
Flyway
Supports a number of databases, implemented in Java.

In contrast with them, Mahout

  • Only supports PostgreSQL; it makes no attempt to support other databases.
  • Prefers using SQL scripts as the representation of schemas (Liquibase seems to prefer representing database schema in an XML schema).
  • Has a very low “deployment footprint”; its prerequisites are minor:
    psql
    Which you already had available if you are using PostgreSQL
    bash
    Which you almost certainly already have around
    pgcmp
    This is a schema comparison tool that mainly uses psql and bash
    tsort
    Part of GNU Core Utilities, used to work out dependency maps
  • Audits changes extensively; Mahout uses pgcmp to ensure that if you are attempting to deploy a change into production, and someone has been messing around with database schemas, you will be made aware of this before you get part of the way through deploying a set of schema changes and discover them breaking.
  • Starts by being paranoid. It assumes that it is necessary to examine production schemas to ensure they are unmodified. If it discovers modifications, they are treated as defects which must be rectified before proceeding. This prevents falling into the problem that a hacked-up production schema is incompatible with the upgrade scripts.
  • Eschews the (apparently from Ruby on Rails) dogma of requiring forwards-and-backwards migrations. The assumption made by Mahout is that your system administrators will arrange for backup to enable rolling back should severe problems be experienced. Commonly, these days, filesystem snapshots tend to be a good way of keeping backups in case of database disaster.

Derivation of the name Mahout

Mahout is the Hindi word for “elephant keeper;” since the PostgreSQL project uses an elephant as its visual mascot, it seems appropriate to use the term “mahout” to describe a tool that helps keep database schemas under control.

Also considered was Pahan, which is the equivalent word in Tamil. There appears to be some record of “mahout” being used as a perjorative involving cab drivers on Long Island near New York City, but that seems sufficiently separate and distant as to be not relevant.

Use Cases

There will be somewhat distinct usage patterns in 4 places where we’d expect the tool to be used:

Developers
Setting up schema to be applied
mahout init
To set up a schema to be managed using mahout, most likely in an SCM repository.
  • Later changes are made by adding SQL DDL scripts and control file entries to indicate how the DDL scripts are assembled to indicate how to upgrade from version to version.
  • If automated build management tools like QuickBuild, Jenkins, Maven are used to manage the deployment of builds, this may be all that developers do themselves.
mahout install, mahout upgrade
When developers prepare test environments for unit tests, they will likely use these Mahout commands too.
Build Process
Generation of build is done by automated processes
mahout capture
This generates authoritative logged information about by-version Schema information. Building everything “from scratch” would involve the following commands
$ mahout install Base   # Installs empty Base schema
$ mahout capture        # Captures data for all versions after that
mahout build
This provides a nice shortcut to generate an archive file comprising the bundle of schema information that needs to be carried downstream.
QA
Applying schema changes to sample production schemas
mahout check
This checks the schema in the database out to see if it matches the version it claims to be on, and verifies that Mahout data has not been tampered with.
mahout attach
If starting with a database that has not yet been mahoutized.
mahout diff
Analyzes the current database schema for differences against the version which it claims to represent.
mahout upgrade
Runs, in order, the needful DDL scripts to upgrade the schema to the latest version.
mahout slony-attach
Attaching mahout to an existing replication cluster
mahout slonik
Using Slony-I to apply the schema upgrade to a replication cluster
Production Control
Deploying changes to production
mahout attach
Only needed if attaching Mahout to a previously deployed database that predates Mahout.
mahout check
Verify that the production database is in good condition to be upgraded.
mahout diff
Analyzes the production database schema for differences against the development schema.
mahout history
List mahout activities against the production database
mahout upgrade
Runs, in order, the needful DDL scripts to upgrade the schema to the latest version.
mahout slony-attach
Akin to mahout attach, this attaches Mahout to a database replicated using Slony-I, verifying that all nodes have schemas matching the specified schema version.
mahout slonik
Prepares and runs scripts for use with Slony-I to upgrade a replicated database cluster.

Mahout subcommands

mahout init directoryname
Checks that it is safe to do so (e.g. - no pre-existing config files)
  • creates the directory afresh
  • Initializes a Mahout project
  • Usually to be done by developers
  • Establishes mahout.conf containing URI based on user environment
  • Creates directory base as the starting point
  • Runs pg_dump to write current schema to Base
  • Runs pgcmp-dump to capture the state of Base as .mahout-data/Base.pgcmp
mahout validate_control
Checks the hygiene of the control file
  • make sure that it is well formed, with valid commands
  • make sure that scripts referenced are all present
  • make sure that requires all reference versions that are listed
  • look for version ordering loops using tsort
mahout capture
This step captures .pgcmp files to prepare builds. It does an install of a specified version of the schema, and captures the schema into .pgcmp.version.
  • It also captures checksums (via md5sum) of the referenced scripts so administrators may be certain that the scripts remain unaltered.
  • It captures .pgcmp files for all versions that have not already been captured
  • It captures lists of locked objects for each ddl script, thus:
    schemaobject nameLock Acquired
    publict1AccessExclusiveLock
    publict3AccessShareLock
    publict3ShareLock

    This information will be useful to an administrator in determining how invasive an upgrade will be to a live system. Locks on tables newly introduced in a particular version will obviously be irrelevant, as a live system can’t be referring to those tables until after the upgrade is complete. But watch out for files with the filename suffix .locks-acquired

  • This action is likely to be used mostly by developers, perhaps as integrated with build automation
  • in a development environment (where NODETYPE=dev), pgcmp dumps are automatically collected on demand; in a production environment (NODETYPE=prod), pgcmp dumps are expected to already exist
mahout install
This is used to install a Mahout-managed schema in a fresh database.
  • This is how one would set up a fresh schema in a QA/production environment.
  • in a development environment (where NODETYPE=dev), pgcmp dumps are automatically collected on demand; in a production environment (NODETYPE=prod), pgcmp dumps are expected to already exist
  • An optional version number allows stopping early; if you request mahout install Base, this will install just the Base version.
mahout attach
This is used to indicate that a particular version should be associated with the present schema
  • This is the action to take to attach Mahout to an existing schema in production
  • The user specifies the version of the schema that they believe is in place
  • Mahout checks that the schema matches the schema against the specified version
  • If all matches, then Mahout would attach its own metadata to enable future tracking
  • It captures expected differences for pgcmp
mahout check
Asks what version is in the schema indicated by mahout.conf, and rummages around and checks the schema against that version using pgcmp
  • Also verifies that upgrade scripts have not been altered by verifying MD5 checksums
mahout upgrade
Performs all upgrades needed to upgrade from the present version (see mahout check) to the last available, or to a specified version.
  • A dry-run option should list its plans in detail, providing a way that DBAs can know the exact intended application ordering of the pieces to assist them in building slonik scripts
mahout diff
Finds any differences between the current version and what is expected to be in that version (e.g. - use pgcmp to compare current schema with a captured schema)
mahout history
List mahout upgrade activities performed against the local database
mahout versions
Walk configuration via tsort to get all versions, and check their application status in the database
mahout changes
This has two perspectives:
  • List the mahout scripts that are to be applied so that DBAs can figure out what they need to prepare for Slony application of the schema
  • List the major objects that change between versions so that we provide useful documentation to downstream users. They can know such things as
    • What tables are being added/removed
    • What tables are being altered
  • This has not yet been implemented; perhaps it is irrelevant in that these changes are reflected automatically by the DDL scripts referenced by the control script.
mahout build version
This takes the contents of the current Mahout directory, and generates an archive containing all of the data.
  • It is essentially a way of avoiding the need to run tar cfvz schema-version.tar.gz
mahout slonik
This generates slonik scripts for use with Slony to perform the specified upgrade.
  • It only works if the version upgrade(s) are either only consist of ddl, ddl-autocommit, and psqltest requests, or only of dml and unix requests; it will fail if there are combinations of DDL and DML together
  • It generates a slonik script with the following:
    • a pre-amble reference to allow set IDs and connection paths to be customized by an administrator
    • a SET DROP TABLE for each table removed
      • The slonik commands are only “full formed” if table IDs are available, hence, if a table is added in one version, say V1, and removed in a later version, say V2, the slonik file generated will be unaware of the table ID, and thus will instead indicate that the ID is unknown. This may be resolved by running mahout slonik again after applying version V1, at which point the slonik for version V2 will be regenerated with valid table IDs.
    • a SET DROP SEQUENCE for each sequence removed
      • As above with SET DROP TABLE, the sequence IDs of sequences that are to be dropped are unknown before the sequence has been added, so if several versions are to be applied, it may be necessary to re-run mahout slonik in order to fill in the sequence IDs for sequences being dropped.
    • an EXECUTE SCRIPT request for each psql script, to apply DDL to the cluster
    • a CREATE SET request, for the new tables and sequences that are added
    • a SET ADD TABLE for each added table
    • a SET ADD SEQUENCE for each added sequence
    • a series of SUBSCRIBE SET requests based on those already existing so that all possible nodes will receive subscriptions to the new tables
    • a MERGE SET request so that the replication set with the newly replicated tables is merged in with the main replication set
  • An alternative approach is to drop replication and recreate with the whole set of tables. But that requires little input, so seems sensible to leave to the user…
  • The slonik script is not invoked; it may be readily invoked after running mahout slonik by the command:
$ slonik .mahout-temp/mahout-ddl-script-1.1.slonik

(where “1.1” might be replaced with a more apropos version label). The slonik script contains relative references to files in .mahout-temp, as well as the DDL scripts in the Mahout installation so should be run with $PWD being set to the mahout directory containing mahout.conf and mahout.control.

Mahout Directory Structure

A schema repository will consist of a directory structure where the top level will have 2 “control” files, and then a series of directories, each indicating schema code to deployed.

No further structure is provided, however it should be obvious that setting up a directory structure to group schema scripts together by version would be wise.

Configuration control file: mahout.conf
This file contains metadata about the databases to be managed.

Alternatively, MAHOUTCONFIG may be set to indicate an alternative location to find the configuration expected in mahout.conf.

Schema control: mahout.control
This file indicates the order in which directories/files should be processed when applying schema changes.
  • If no mahout.control file is provided, at a given directory level, that will be treated as an error that would cause mahout schema application work to fail.

Contents of mahout.conf

  • URIs to indicate how to access databases of interest
    MAINDATABASE
    Contains the URI to be used for processing of schema files.
    SUPERUSERACCESS
    Contains the URI to be used for anything requiring superuser access
    COMPARISONDATABASE
    This indicates the URI of a database suitable for doing comparisons using pgcmp
    MAHOUTOMITSCHEMAS
    This indicates a set of schemas that are to be ignored when pgcmp is being used to do comparisons.
    • It is formatted as an SQL VALUES clause, so should look like (‘pg_catalog’),(‘information_schema’),(‘MaHoutSchema’)
    • Note that mahout.conf is handled as a shell script, and, if using bash, quotes will be needed otherwise MAHOUTOMITSCHEMAS will interpret the structure as an array
    PGCMPHOME
    Indicates the directory where the pgcmp comparison tool is installed
    MAHOUTSCHEMA
    Indicates the schema in which to put Mahout version and logging information
    MAHOUTSEARCHPATH
    Indicates a value to set as the default search_path at the start of each DDL/DML step. This is particularly needed when using Slony-I, where the slony user often has no default search path, and, more pointedly, if one DDL/DML script for a version explicitly sets search_path.
    • If run via mahout upgrade, the user’s default search path would automatically be attached in each step;
    • If running via mahout slonik, the slony user’s default search path (often none) would be attached to the first DDL/DML step; subsequent DDL/DML steps continue, implicitly receiving whatever search_path was in effect at the end of the previous step.
    • A MAHOUTSEARCHPATH value will be attached at the start of each DDL/DML step.
    MAHOUTOMITPG
    If true, then all use of pgcmp will be omitted, as well as database activities in the mahout build step. The purpose of this option is to support the case where you cannot have a PostgreSQL database in the build environment
    MAHOUTNODETYPE
    indicates if this is a development environment (where pgcmp files are automatically generated), or production (or QA) where pgcmp files are to be used (and not generated)
    dev
    indicates that this is a development environment, where pgcmp output is automatically generated by mahout capture, mahout install
    qa or prod
    indicates that this is an environment where pgcmp output is expected to be found in the build
  • Data files
    • In some cases, external data will be needed, e.g. - some upgrades required a data file indicating configuration of DML changes.

Note that if MAHOUTCONFIG is set, then the file at the specified location will be used instead of the mahout.conf file in the Mahout archive.

COMPARISONDATABASE=postgresql://postgres@localhost:7099/comparisondb
MAHOUTOMITSCHEMAS="('pg_catalog'),('information_schema'),('MaHoutSchema')"
MAHOUTSCHEMA=MaHoutSchema
MAINDATABASE=postgresql://postgres@localhost:7099/devdb
PGCMPHOME=/home/cbbrowne/PostgreSQL/pgcmp
SUPERUSERACCESS=postgresql://postgres@localhost:7099/postgres
CONFIG_26=/tmp/redwood-upgrade-2.6.conf
CONFIG_23y=/tmp/redwood-2.3y-currencies.txt
CONFIG_2226=/tmp/registrar-guid-map-2.2.26.csv
CONFIG_22=/tmp/registrar-guid-map-2.2.csv
MAHOUTSEARCHPATH=public,registry

Contents of mahout.control

The mahout.control file contains information indicating which changes are associated with each version of the database schema. It does not contain any database configuration, as that will vary between development, QA, and production environments; that is controlled in the mahout.conf file described previously. Instead, it contains sections indicating schema information and how it ties to versions of that schema.

Version dependency data
The version and requires directives identify, for each version, the label for that schema version, and what schema version is the prerequisite that must be installed first.

The Base version is special, being the initial “Base” version that has no prerequisite.

Schema application directives
For each file processed, there is a line to indicate how it is to be processed. These four processing types are mutually exclusive for any given version; you may have as many of each in a version upgrade as desired, but to switch to a different behaviour, you need another version.
notes
Indicates a file containing Markdown input that are to be transformed into HTML as notes about the version.
ddl
Indicating that the file contains SQL DDL and should be processed using psql against MAINDATABASE. It is an error to perform DML in a ddl script, and mahout may check for this and report an error if data other than schema is modified.
dml
Indicating that the file contains SQL DML and should be processed using psql against MAINDATABASE. It is an error to perform DDL in a dml script, and mahout may check for this and report an error if schema is modified.
unix {parameter}
Indicating that the file should be run as a UNIX command, with the values in mahout.conf loaded into the environment, and with data for {parameter} passed as the singular argument to the command. It is strongly urged that unix only be used to handle DML changes, that is, to change the data inside tables, as opposed to doing DML (table structure alterations). unix commands are not amenable to capturing .locks-acquired data.

The parameter value indicates a single parameter to be captured in the mahout.conf file and passed to the command to indicate its configuration. If complex configuration is required, then the single parameter may point to a configuration file containing as much configuration as necessary.

Testing directives
SQL scripts may be used to perform tests to verify that the schema satisfies computable requirements.
common tests
This defines a single section of the mahout.conf file to group together tests that are to be commonly applied across all versions of the schema (perhaps with per-version restrictions for particular tests). That way, validations intended to apply organizational policies are automatically applied to every new version deployed, by default.

Alternatively, tests may be attached to a particular schema version.

If a particular test is only applicable to one version of the schema, then it should be associated with that version. If a test represents common policy, to be run against several versions, then it should be associated via common tests, so that it only needs to be referenced once.

psqltest
This indicates a test that is to be run against every version of the schema.
psqltest from [Version]
This indicates a test to be run against every version of the schema beginning with a particular version. In effect, that means that there is a new requirement that will not be applied against “legacy” schema versions.
psqltest to [Version]
This indicates a test that is to be run against versions of the schema beginning with Base, and ending with the specified version. In effect, this indicates that there was some requirement that becomes relaxed after the final indicated version.
psqltest from [FirstVersion] to [SecondVersion]
This indicates a test that begins applicability with the first version label indicated, and which ceases to be applicable after the second version label.

Example of mahout.control

# Note that Base is actually version 1.5; that was where we started...
version Base
  ddl Base/base-schema.sql

common tests
  psqltest from 2.2 to 2.2.26 test/parent-test-1.sql
  psqltest from 2.2.26 test/parent-test-2.sql
  psqltest from 2.3y test/pf-currency.sql
  psqltest from 2.2 test/gf.sql
  psqltest from 2.6 test/bigint-checks.sql
  psqltest from 2.4 test/message-queuing.sql
  psqltest from 2.3z test/ropq.sql
  psqltest test/all-tables-commented.sql
  psqltest test/all-functions-commented.sql
  psqltest test/table-names-unique.sql

version 2.0
  requires Base
  superuser 2.0/drop_roles.sql
  ddl 2.0/public.sql
  ddl 2.0/app_private.sql
  ddl 2.0/app.sql
  ddl 2.0/devtools.sql
  psqltest 2.0/new-feature-in-2.0.sql

version 2.1
  requires 2.0
  ddl 2.1/public.sql
  ddl 2.1/app.sql
  ddl 2.1/drop_deprecateds.sql
  ddl 2.1/devtools.sql

version 2.2
  requires 2.1
  ddl 2.2/public.sql
  ddl 2.2/app_private.sql
  ddl 2.2/adjustment_criteria.sql
  ddl 2.2/migrate_ids.sql CLIENT_ID_FILE=${CONFIG_22}

version 2.2.26
  requires 2.2
  ddl 2.2.26/public.sql
  ddl 2.2.26/migrate-ids.sql CLIENT_ID_FILE=${CONFIG_2226}

version 2.3x
  requires 2.2.26
  ddl 2.3x/public.sql
  ddl 2.3x/app.sql
  ddl 2.3x/app_private.sql

version 2.3y
  requires 2.3x
  ddl 2.3y/public.sql
  ddl 2.3y/app_private.sql
  ddl 2.3y/app.sql
  ddl 2.3y/devtools.sql
  unix 2.3y/migration-currency.sh PF_CURRENCY_FILE=${CONFIG_23y}
  unix 2.3y/migrate-pf-currency.sql

version 2.3z
  requires 2.3y
  ddl 2.3z/public.sql
  ddl 2.3z/app_private.sql
  ddl 2.3z/app.sql
  ddl 2.3z/migration-script.sql

version 2.4
  requires 2.3z
  ddl 2.4/app.sql

version 2.5
  requires 2.4
  ddl 2.5/public.sql
  ddl 2.5/app_private.sql
  ddl 2.5/app.sql

version 2.6
  requires 2.5
  unix 2.6/prepare-rapp-schema-upgrade.sh ${CONFIG_26}

version 2.7
  requires 2.6
  ddl 2.7/globals.sql
  ddl 2.7/public.sql
  ddl 2.7/app_private.sql
  ddl 2.7/app.sql

version 2.8
  requires 2.7
  ddl 2.8/public.sql

Slony Specific Parameters

If running Mahout against a Slony cluster, additional configuration is required:

SLONYCLUSTER
This has the name of the Slony Cluster, which is what, with a prepended underscore, becomes the schema name used for Slony’s internal configuration.
SLONYMAINSET
This is the ID of the replication set into which things should be merged after all tables and sequences have been added to replication.
SLONYTEMPSET
This is the ID of a (not-presently-existing) replication set that is to be used for tables and sequences that need to be added.
SLONYOMITTABLES
This is a set of tables that should be omitted from replication, represented as a SQL IN clause.
SLONYOMITSEQUENCES
This is a set of sequences that should be omitted from replication

Things Mahout does not itself do that we do today in Registry Universal Schema

These things need to be covered somewhere.

Permissions Provisioning

The Afilias Universal Schema hs a fairly sophisticated permissions model constructed out of:

  • Matrix of detailed permissions for Registry-Services-managed roles
  • Schema-based (e.g. - granting same permissions for a whole “directory” of objects) for other roles

This does not need to be covered by specific “permissions provisioning” tools; having regression tests that verify that permissions have been provisioned should suffice.

Testing schema for various conditions

There are a set of tests in the universal-schema build today that verify several things. The Mahout psqltest facility should nicely cover these sorts of requirements.

  • Common rules
    • Verify that all functions, tables, columns are commented
    • Verify that all tables have primary keys
    • Verify that table names are unique
    • Verify that tables are owned by a suitable user
      • Ideally, this should not be the database user under which mahout runs, so that we’re not merely getting defaults.
  • Unit tests
    • Grandfathering
    • Message queueing
    • Policy cuts
    • Premium pricing
    • Registry operator poll queue
    • VAT

Automatically generating documentation

Some documentation is automatically generated using:

  • Schema Spy
  • PostgreSQL autodoc

Mahout would not automatically provide this. It could be an interesting extension for Mahout to generate per-version copies of documentation, but it is likely appropriate for this to be separated out and generated separately.

There are a couple of ideas to be had here:

mahout schemadocs
It would be interesting to have a Mahout command that can run Schema Spy and/or autodoc to generate documentation. That seems like an easy extension.
mahout release-notes
It is common to expect to have per-version release notes. For Mahout to support this by having a convention for collecting and generating per-version release notes would be a neat idea.

Under-Organized Requirement Ideas

The upgrade system needs to identify and control all schema changes to be made.

  • [ ] User that runs updates
    • [ ] Usually should be schema owner
    • [ ] Some changes must be handled via superuser
      • Actually, heading to prefer superuser, and have ownership validation rules to ensure everything does not belong to the superuser.
      • Having a special non-superuser for this looks nice in simple environments; with Slony, where changes are generally applied by a superuser, all that goes pretty forcibly out the window.
  • [X] Multiple SQL scripts
    • [X] Need a mechanism to order them
  • [X] Nice to have: ensure DDL and DML do not get done in the same script
    • [X] Can this be verified automatically?
  • [X] Configuration file that indicates
    • [X] Place to log things
    • [X] Postgres binaries
    • [X] PostgreSQL URI
  • [ ] Version number capture
    • [ ] We use stuff in upgrade_version_to_latest.sh to indicate the version in the _oxrsversion schema
    • [ ] Current stuff is:
      [ ] Branch
      which may become the version label
      [ ] Generated on host
      Reasonable for Mahout to do differently
      [ ] Generated at time
      Reasonable for Mahout to do differently
      [ ] SCM checkout information
      Reasonable for Mahout to do differently
  • [-] Log activity
    • [X] For each sub-component, identify what was run, when it ran, how long it took
    • [X] Some may go into database
    • [X] Successes are no problem; failed schema would be troublesome to get into DB as requests would fail
    • [ ] Probably need to determine how to serialize some logs into filesystem
  • [ ] Supplemental configuration
  • [-] Standard tests
    • [ ] Check that functions all have comments
    • [ ] Check that tables and views all have role-based permissions attached to them
    • [ ] Check that tables all have primary keys-
    • [ ] Check that tables have unique names
    • [ ] Check that tables and their columns have comments
    • [X] Hooks to allow running custom tests
    • [X] Which tests to run against which versions?
  • [-] We’ll have several kinds of things to execute…
    • [X] SQL scripts that need to be run
    • [X] Shell scripts that need to be run
    • [ ] Might there be some Python? Unnecessary
  • [X] Various pre- and post-conditions
    • [X] Run pgcmp to check that schema matches expectations, e.g. - captured pgcmp-dump matches the schema
    • [X] Need a mechanism that runs pgcmp-dump to capture schema at various points
    • [X] Run tests, and capture either perfect conformance or counts/details of non-conformance?
  • [ ] Seeding
    • [ ] Security needs
      • [ ] Roles
        • [ ] What to do about the possibility of needful roles evolving over time?
      • [ ] Basic users
        • [ ] Basic information needed will be
          • [ ] Owner
          • [ ] Superuser (hopefully little needed)
  • [ ] Version and tagging model
  • [ ] mahout versions
  • [ ] mahout changes
  • [X] Implement MAHOUTSEARCHPATH
    • if it is set, then add EXECUTE SCRIPT(SQL='set search_path to public, private, ...;')

About

schema management tool for PostgreSQL, using pgcmp to find discrepancies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages