Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SQL:2011 temporal validity #4704

Open
5 of 15 tasks
lukaseder opened this issue Oct 31, 2015 · 7 comments
Open
5 of 15 tasks

Add support for SQL:2011 temporal validity #4704

lukaseder opened this issue Oct 31, 2015 · 7 comments

Comments

@lukaseder
Copy link
Member

lukaseder commented Oct 31, 2015

SQL:2011 standardised on bitemporal validity, which is an important topic in many areas where data needs to be historised in two ways:

  • Technically (e.g. like Oracle's flashback query)
  • Semantically (e.g. a validity period of a certain record)

A good paper from IBM illustrates these parts of the SQL:2011 standard:
http://www.sigmod.org/publications/sigmod-record/1209/pdfs/07.industry.kulkarni.pdf

Few databases support these features natively, and if they do (DB2, Oracle, SQL Server), they ship with a vendor-specific syntax.

jOOQ should support all of those syntaxes (Oracle flashback query is already supported) and go beyond by providing emulations for these features:

System versioning (technical)

An additional INTEGER or TIMESTAMP column per table can be used for versioning, similar to the existing optimistic locking versioning scheme in jOOQ. Such columns will be configured in the source code generator and marked specifically in generated code. An additional BOOLEAN "versioned" column might be added too, to simplify identifying the current version of each row.

Impact on queries:

  • SELECT statements must ensure that only the latest version of a row is fetched, unless previous versions are explicitly asked for using AS OF on the org.jooq.Table
  • INSERT statements remain unchanged
  • UPDATE statements will execute as:
    • UPDATE the existing records, mark them as "versioned"
    • INSERT .. SELECT the existing records in order to duplicate them
    • Run the actual UPDATE on the duplicated rows
  • DELETE statements will execute as:
    • UPDATE the existing records, mark them as "versioned"
  • MERGE and other forms of UPSERT are not supported

Semantic versioning

An additional DATE or TIMESTAMP range column per table can be used for semantic versioning to specify the validity of a row during a range. Such columns will be configured in the source code generator and marked specifically in generated code. It is assumed that ranges must not overlap for the same primary key. Typically, the primary key will include an ID and the range's FROM date.

Impact on queries:

  • SELECT statements usually fetch all the ranges, unless a specific period filtering predicate was applied
  • INSERT statements must ensure that no periods overlap. If an insert is made for a period that overlaps with an existing period, the existing period is
    • shortened in the case of a partial overlap
    • split in two in the case of a "contains" overlap
  • UPDATE statements remain unchanged, unless an explicit FOR PORTION OF clause is issued, in case of which the UPDATE statement translates to individual INSERT statements for each overlapping period
  • DELETE statements remain unchanged, unless an explicit FOR PORTION OF clause is issued, in case of which the DELETE statement translates to an INSERT statement for that specific portion and then deletes it again.
  • MERGE and other forms of UPSERT are not supported

Alternative implementations are possible too

In all cases, technical versioning may apply in addition to semantic versioning

Impact on UpdatableRecord

In addition to transforming individual DML statements for explicit use, it will be very useful to have additional generated store(), insert(), update(), delete() methods on UpdatableRecord, which helps executing temporal DML directly on a CRUD level. The semantics of this change yet needs to be defined.

Tasks

A number of improvements need to be made in order to accomplish this feature. These will be listed here, and possibly, extracted to new issues:

See also

@mfoody
Copy link

mfoody commented Nov 11, 2015

Incredibly interested in this feature. I'm currently working on an application backed by Postgres in which virtually all the data is either completely immutable or bitemporal. It has actually been fairly painless using JOOQ compared to previous experience using ORMs on top of bitemporal schemas. We are using Joda time's Intervals in Java (Scala) and Postgresl's range types inside the database which seem to work together perfectly.

A couple questions on your approach,

  • You propose implementing system versioning with an integer or timestamp column. The linked document as well as other sources (for example the Snodgrass book) manage this using a pair of dates. How do you track deletes using a single value? Also for databases without direct support for SQL:2011, a SELECT on a single datefield is going to involve some kind of subquery or join to select a single value.
  • How will the system time be determined when inserting/updating/deleting rows? In our application we define the system time in the application to ensure that all rows updated in a transaction have a consistent system interval.

Should add, thank you for a fantastic product!

@lukaseder
Copy link
Member Author

It has actually been fairly painless using JOOQ compared to previous experience using ORMs

I'd love to learn more about that. Did you intend to write a blog post somewhere, about your experience? That would be very useful!

You propose implementing system versioning with an integer or timestamp column. The linked document as well as other sources (for example the Snodgrass book) manage this using a pair of dates. How do you track deletes using a single value?

In principle, the only valid system version is the latest one (highest timestamp, or sequence number). Deletes would need to be tracked with an additional value, indeed. The reason why I wanted to use this single timestamp / sequence number is because we already have such synthetic columns for optimistic locking. It would be good to reuse existing functionality here.

Also for databases without direct support for SQL:2011, a SELECT on a single datefield is going to involve some kind of subquery or join to select a single value.

Yes, this isn't completely specified yet. If the additional value was a boolean (called "versioned" in the feature request text) indicating whether the record is active (and there can be only one active record per primary key), then there's no problem to find the latest version very quickly.

An entirely different option for system versioning would be to use a secondary table of the same row type. Ultimately, we don't want to enforce one or the other model. Which is why this will be a rather complex feature to implement.

How will the system time be determined when inserting/updating/deleting rows?

I've had the same thoughts. SQL:2011 implementations also use the transaction time (see again the linked document)

Should add, thank you for a fantastic product!

Thank you :)

@mfoody
Copy link

mfoody commented Nov 17, 2015

No blog post yet. Hopefully once we get into production and things calm down a little bit (is that ever how it actually works?).

One last question about your proposed design.

Yes, this isn't completely specified yet. If the additional value was a boolean (called "versioned" in the feature request text) indicating whether the record is active (and there can be only one active record per primary key), then there's no problem to find the latest version very quickly.

It seems like this is going to require two additional fields.

  • A boolean field to track the current version. This would make selecting the current version trivial which is what you usually want but would still make it quite expensive (and tricky) to select as of a specific bitemporal time.
  • A timestamp to track when a delete occurs. If for a given primary key the boolean is always false then you know that latest value was deleted but not when the delete occurred which is usually essential. This timestamp would be null for all non-deleted rows.

If I'm understanding that correctly and you do need the timestamp, I'm not sure the current column adds additional value.

@lukaseder
Copy link
Member Author

@mfoody: Sorry for the delay. You may well be right, I cannot tell yet. We do want to implement this for jOOQ 3.8, but we haven't thought this through yet. It might just be a bit too early to discuss all the details fo the proposal.

@lukaseder
Copy link
Member Author

Not realistic for jOOQ 3.8

@lukaseder
Copy link
Member Author

The full scope of this feature will be reviewed again in jOOQ 3.14, but parts might already be in scope for 3.13, specifically #9741.

@lukaseder
Copy link
Member Author

Basic API support mapping to native implementations has been added for 3.13 through #9741. This includes:

  • FOR clause support in SELECT statements
  • FOR PORTION OF clause support in UPDATE and DELETE statements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants