Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CEP] Case snapshots #27315

Open
snopoke opened this issue Apr 29, 2020 · 11 comments
Open

[CEP] Case snapshots #27315

snopoke opened this issue Apr 29, 2020 · 11 comments
Assignees
Labels
CEP CommCare Enhancement Proposal

Comments

@snopoke
Copy link
Contributor

snopoke commented Apr 29, 2020

Abstract
To allow old form data to be archived from the system without impacting the integrity of the case data model.

Cases are built by successively applying the case transactions extracted from form data. Without ready access to ALL the forms required to make up a case no case transactions can be rolled back or archived since the case cannot be correctly rebuilt if there are missing forms.

This CEP proposes the use of periodic case snapshots which allow form data beyond a certain point in time to be removed without impacting the integrity of the case after that point in time.

Motivation
For long running projects having access to raw form data becomes less useful as the data ages. Conversely the cost of keeping the data remains the same since CommCare requires ‘fast’ access to the form data for certain workflows e.g. case rebuilds due to form archiving.

Without a way to isolate the case data from the full history of the case there is no way to safely remove form data from the system.

Specification

Glossary

data availability cutoff
the point in time beyond which forms are no longer accessible to CommCare as part of the OLTP dataset
data availability window
the period of time from the present up until the data availability cutoff
‘available’ data
data that falls within the data availability window e.g. available case snapshots are all snapshots whose creation date falls within the data availability window
max snapshot age
maximum time interval between snapshots
max snapshot transaction count
maximum number of form transactions that should be contained within a snapshot

Snapshots should only be created for the current state of a case (it would be too costly to to generate historical snapshots for all cases). A snapshot must include all the relevant fields necessary to reset the case to the exact state at the time of the snapshot. This should also include the state of any ledgers associated with the case.

The implication of this is that this feature will not be immediately useful since snapshots are generally only useful further back in the timeline. In terms of meeting the goals of this CEP, archival of form data will only be possible once there are historical case snapshots for all active cases up to the data availability cutoff.

The creation of a case snapshot should be triggered by a form transaction for the case. A case that is not receiving new form transactions should not have a snapshot generated even if it’s last snapshot is older than ‘max snapshot age’.

A snapshot should be created for a case that meets the following criteria:

  • form transactions since last snapshot > “max snapshot transaction count”
    OR
  • last snapshot age > “max snapshot age”

A rebuild of a case should refresh any snapshots newer than the date of the rebuild. Refreshing a snapshot can be done as follows:

  • generate the new form XML
  • save the new form metadata and new case transaction (transaction dates should match the original snapshot dates)
  • revoke the old snapshot case transaction

Notation:

  • C = case

  • FX = form X

  • FXx = edited form X

  • SX = snapshot form X

  • SXx = updated snapshot X

    C = F1, F2, S1, F3, S2, F4

Rebuild prior to S1 not permitted since F1 and F2 are beyond the data availability cutoff point.

Rebuild from F3:

  • Closest snapshot = S1
  • Transactions for rebuild: S1, F3a, S2, F4
    • S2 included in this list since it will need to be updated during the rebuilding process

      C = F1, F2, S1, F3a, S2a, F4

Removal of form data (archiving)

  • Forms archived by date range according to the received_on date
  • The “data availability cutoff” should be stored in a database to allow CommCare to determine what data is ‘available’ and what is not.
  • (optional) The “data availability cutoff” may vary depending on the domain allowing domains to customize the length of availability of their data (possibly at additional cost for extended periods)
  • Impacted models
    • OLTP database
      • forms
        • These records should be updated once the data has been successfully archived. Any information required to retrieve the form from archival should be included in the record.
      • case transactions
        • The state of these records should be updated to indicate that they are archived
      • ledger transactions
        • The state of these records should be updated to indicate that they are archived
    • BlobDB
      • form XML
        • Data should be moved to a lower cost storage
      • form attachments
        • Data should be moved to a lower cost storage
    • Elasticsearch
      • forms
        • Ideally the form index has a date based rollover policy allowing old indexes to be removed completely or else moved to lower cost storage.

Case rebuild check rules

  • Date from which to rebuild the case = D

A case can be rebuilt if:

  • D > data availability cutoff
  • AND
    • The first transaction for the case T
      • T > data availability cutoff
    • OR
    • A case snapshot S exists for the case such that
      • S > data availability cutoff
      • S < D

Storage

A case snapshot should be stored as a single form with all the necessary case blocks required to restore the case into the exact state it was when the snapshot was created. This may require some new case block primitives in order to allow setting certain metadata fields on the case.

The form will have the following XMLNS: http://commcarehq.org/case/snapshot

As with normal forms each snapshot will be recorded in the OLTP form table as well as in the OLTP case transaction table. The case transaction will be of type form with an additional type bit set to indicate that it is a snapshot and allow easy filtering.

transaction.type = FORM | SNAPSHOT

Impact on users

  • Form data received before the data availability cutoff
    • may only be accessible on request and will take longer to produce than available data.
    • may not be altered in any way (data cleaning, archiving etc)
  • A form edit / archive / unarchive may only take place if ALL cases touched by the form meet the ‘case rebuild check rules’ described above.

Impact on hosting
This will increase the storage requirements for hosting CommCare by a small margin.

Backwards compatibility
NA

Release Timeline
NA

Open questions and issues
Should we try to avoid the situation where a case that only gets sporadic updates ends up with as many snapshots as forms. This could happen if a case only get updated once per month and the “max_snapshot_age = 1 month”

C = F1 S1 F2 S2 F3 S3 …

@snopoke snopoke added the CEP CommCare Enhancement Proposal label Apr 29, 2020
@millerdev
Copy link
Contributor

A rebuild of a case should refresh any snapshots beyond the date of the rebuild

This statement seems ambiguous. Would it be accurate to say

"A rebuild of a case should refresh any snapshots newer than the date of the rebuild."

What operations are involved when a snapshot is "refreshed"? Is it similar to form deprecation where the old snapshot is deprecated and the corresponding case transaction is revoked? Is it correct that the new "refreshed" snapshot will have a case transaction with the same server_date as the old one since case transactions are sorted by that field on case rebuild? If not, will case transaction sorting change in some way?

Rebuild from F3:

What would trigger this operation? F3 was edited?


Why have multiple snapshots per case?

Would it meet the objective to make a case snapshot operation that replaces forms older than a given data availability cutoff date with a snapshot form per updated case? In this scenario there would be at most one snapshot form per case, and it would always be the first/oldest form transaction associated with the case. Given a batch of forms to be snapshotted, once a snapshot form is created for each referenced case, the replaced forms can be safely deleted and/or moved elsewhere for further analysis.

The snapshot process could be setup as a long-running resumable task that operates on batches of forms, where form ids in each batch are upserted into an archived_forms table once all related case snapshots have been created. Decoupled form archival processes may run against the archived_forms table, copying and/or deleting forms as needed.

The snapshot process could be run multiple times, each time with a newer data availability cutoff date. Multiple concurrent runs with different cutoff dates currently has undefined outcome in my head, but it may be possible to make that safe if needed.

One advantage of this scenario is that there is never a necessity to "refresh" a snapshot after it has been created.

@snopoke
Copy link
Contributor Author

snopoke commented Apr 30, 2020

"A rebuild of a case should refresh any snapshots newer than the date of the rebuild."

Updated

What operations are involved when a snapshot is "refreshed"?

Since we'd need to maintain the date of the transaction I think this would either involve what you suggest, revoke + create new with correct date, or it would just overwrite the previous blob (2nd option seems a bit more dangerous and less atomic).

What would trigger this operation? F3 was edited?

An edit or an form archive

Why have multiple snapshots per case?

In your scenario the only way to generate the snapshot would be to take the previous snapshot and replay all the new forms on top of it to get the new state. This would need to be done for every single case which is analogous to reprocessing every single form in the timeframe. The reason I didn't go with this option is because of the vast amounts of processing required to do it.

@millerdev
Copy link
Contributor

In your scenario the only way to generate the snapshot would be to take the previous snapshot and replay all the new forms on top of it to get the new state.

Ah, I see. I thought that was also implied in this sentence from your proposal:

A case snapshot should be stored as a single form with all the necessary case blocks required to restore the case into the exact state it was when the snapshot was created.

But I see now that the case blocks for the snapshot can be generated directly from the case. Thanks for hearing me out and clearing that up.

@snopoke
Copy link
Contributor Author

snopoke commented May 1, 2020

Note: I'm going to try and update this with some of the details about form archiving since I think it makes sense to think about them together.

@snopoke
Copy link
Contributor Author

snopoke commented May 1, 2020

Updated:

  • snapshots should include ledger data
  • added section: "Removal of form data (archiving)"
  • added section: "Case rebuild check rules"
  • updated "Impact on users"

@millerdev
Copy link
Contributor

A form edit / archive / unarchive may only take place if ALL cases touched by the form meet the ‘case rebuild check rules’ described above.

It seems desirable to design the system such that it cannot get into a state where a form can be viewed (in a way that a user may consider performing an edit / archive / unarchive operation on it) that is associated with a case that does not meet the ‘case rebuild check rules’. In other words, it should be impossible to find any form in the system that cannot be edited, for example. Can we achieve this design goal? If not, why?


Use of the word "archive" is potentially problematic because it conflicts with the current "form archive" procedure, which as far as I can tell is completely unrelated to this new kind of form archival.

Example uses in this CEP:

  • New type of archive: "Removal of form data (archiving)" and "Forms archived by date range..."
  • Old type of archive: "A form edit / archive / unarchive may only take place if..."

Consider adopting a different term for what happens to forms beyond the data availability cutoff date.

@sravfeyn
Copy link
Member

sravfeyn commented May 4, 2020

Should we try to avoid the situation where a case that only gets sporadic updates ends up with as many snapshots as forms. This could happen if a case only get updated once per month and the “max_snapshot_age = 1 month”

We could get around this if we have max_snapshot_age per form type. But not sure if that complicates other things.

Few questions

  • How does the case history report work with this change?
  • What process is responsible for generating snapshots? Is this done asynchronously when a form is edited or archived?
  • If it's done only when form is edited or archived by users, won't the likelihood be low since not many forms are edited/archived? Or is there a process that's archiving forms older than X period?

@snopoke
Copy link
Contributor Author

snopoke commented May 4, 2020

Consider adopting a different term for what happens to forms beyond the data availability cutoff date.

100% I was looking for alternatives but didn't come up with anything. Got any suggestions?

It seems desirable to design the system such that it cannot get into a state where a form can be viewed (in a way that a user may consider performing an edit / archive / unarchive operation on it) that is associated with a case that does not meet the ‘case rebuild check rules’. In other words, it should be impossible to find any form in the system that cannot be edited, for example. Can we achieve this design goal? If not, why?

Yea, I'm not sure about this - I think this needs to be balanced with making data available to the users. I think there are a lot of details that need to be considered before we can actually remove any form data. The focus of this CEP is the case snapshots.

@snopoke
Copy link
Contributor Author

snopoke commented May 4, 2020

How does the case history report work with this change?

As mentioned in above comment I think there are still a lot of things that aren't addressed here with regard to removing form data which is not the focus of the CEP. As far as the snapshots showing up in case history - I think we could exclude them from reports etc.

What process is responsible for generating snapshots? Is this done asynchronously when a form is edited or archived?

"The creation of a case snapshot should be triggered by a form transaction for the case". i.e. it is triggered during form submission. I think doing it synchronously would make the most sense.

If it's done only when form is edited or archived by users, won't the likelihood be low since not many forms are edited/archived? Or is there a process that's archiving forms older than X period?

There is not process that's removing form data as yet. I think there's still a lot of work to do before we can start that. This CEP is one of the pieces.

@esoergel
Copy link
Contributor

esoergel commented May 4, 2020

archival of form data will only be possible once there are historical case snapshots for all active cases up to the data availability cutoff.

What about inactive cases? Wouldn't they also need snapshots if their forms are to be archived?


A case can be rebuilt if:
...

Imagine a case with some updates, including some snapshots, but where the most recent snapshot is older than the data availability cutoff

F1, S1, F2, F3, S2, F4, <cutoff>, F5, F6

This seems like a boundary condition could occur where a user submits data a day too early to trigger a new snapshot, then their supervisor wants to make a change to that submission. This case won't be available for edit or archive until the next form submission triggers a snapshot, even though the most recent form submission was only a day or so ago. Is my understanding of this correct?

I was assuming that the max snapshot age would be the same as the data availiability window, but looking back over your original comment, I see that's not actually stated. In any case, it sounds like the interplay between the two would be crucial in determining what window of data is guaranteed to be fully available.

One approach to this would be to require that max snapshot age be less than half of data availability window, and so ensuring that any activity within the max snapshot age is fully available for edit. Or introducing a third concept of data editability window, which is defined as data_availability_window - max_snapshot_age

Alternatively, this data expiration could be approached not as a rolling window, but as a series of horizons. For example, you have full access to data from the past 3 months, but on June 1st, you lose access to data not modified since February, and on July 1st, March, and so on (snapshots would have to be made during the first update in each new month). This sounds harder to work with in code, but might be easier to explain to partners and implementers.


A form edit / archive / unarchive may only take place if ALL cases touched by the form meet the ‘case rebuild check rules’ described above.

Seems like we'd want fast and cheap access in queries to whether a case can be edited or archived. For instance, perhaps a data_availability_date case property (better name TBD) representing either the date of the most recent case snapshot or the date of the first transaction, as appropriate.


Consider adopting a different term for what happens to forms beyond the data availability cutoff date.

100% I was looking for alternatives but didn't come up with anything. Got any suggestions?

I've heard the term "freezing/frozen" used for this sort of thing elsewhere. Eg AWS Glacier, ES frozen indices.

@snopoke
Copy link
Contributor Author

snopoke commented May 5, 2020

I've heard the term "freezing/frozen" used for this sort of thing elsewhere. Eg AWS Glacier, ES frozen indices.

+1

What about inactive cases? Wouldn't they also need snapshots if their forms are to be archived?

If a case doesn't get modified then successive snapshots would be identical to each other and don't add any benefit.

In terms of max snapshot age and data availability cutoff I had thought they would be of this order of magnitude.

  • max snapshot age = 2 months
  • data availability cutoff = now - 1 year (or somewhere in that ballpark)

The 'freezing' of form data doesn't have to match up exactly with the cutoff, I think it may make sense to have the actual 'freezing' process lag the cutoff by max snapshot age to maximize the likelihood of a snapshot being available. Using a series of horizons is also likely to be how it get's implemented particularly if we use rollover indexes or similar mechanisms to actually do the 'freezing'.

Seems like we'd want fast and cheap access in queries to whether a case can be edited or archived. For instance, perhaps a data_availability_date case property (better name TBD) representing either the date of the most recent case snapshot or the date of the first transaction, as server_dateappropriate.

Editing and archiving aren't very common operations and the rules for allowing it based on the case snapshots are very dependent on the date of the form so I don't think it will work to have a case property storing the date of the last snapshot (though that will likely be useful for the snapshot process itself). When displaying a form a simple query to the cases would allow us to know if it is editable / archivable:

select 1 from case_transaction 
where case_id in (case_ids) and type & $SNAPSHOT = $SNAPSHOT 
and server_date < $form_received_on and server_date > $data_availability_cutoff

@snopoke snopoke changed the title Case snapshots [CEP] Case snapshots Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CEP CommCare Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

7 participants