Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CEP] Form XML Compression in BlobDB #26813

Open
snopoke opened this issue Mar 9, 2020 · 12 comments
Open

[CEP] Form XML Compression in BlobDB #26813

snopoke opened this issue Mar 9, 2020 · 12 comments
Assignees
Labels
CEP: done CEP CommCare Enhancement Proposal

Comments

@snopoke
Copy link
Contributor

snopoke commented Mar 9, 2020

Abstract

Form XML makes up 70% of the BlobDB data on ICDS. In environments with fewer form attachments this number is likely much higher. This CEP proposes to compress the XML data to reduce the storage requirements of the BlobDB.

Motivation

The form XML is saved to the BlobDB uncompressed and as stated above it makes up at least 70% of the data in the BlobDB. Basic tests show that compression can give considerable reduction in storage.

On a sample of 10000 ICDS forms:

Average file size before compression 9579 bytes
Average file size after compression (gzip -6) 2031 bytes

Specification

Since this is an isolated use case the compression and decompression will not be handled by the BlobDB.

This specification allows for the compression of the XML to take place in three different places. The purpose of this is to allow experimentation to determine the best place for the compression to take place based on data from actual usage (see Metrics section below).

Compression

The compression should use gzip since it offers good compromise between compression ratio and speed. It is also well supported in Python.

Metrics
The following metrics should be added to allow monitoring of this change both to ensure it is correctly functioning and to provide data to better understand the access patterns.

  • commcare.form_submission.read_compressed

    This metric will give us insight into how many times a compressed XML is read and where (which process) is it read.

    Tags:

    • source
      • Django view name (get_current_request().resolver_match.view_name)
      • celery task name (get_current_task().name)
      • Django management command name (not sure how to get this)

Changes to CommCare

This should be implemented within the BlobDB in order to make it transparent to calling code.

  1. Update the BlobDB Metadata to keep track of compressed data
    compressed = BooleanField(null=True)

Question: should this be indexed?

  1. Update put API to compress data based on the type code

Since we only want to compress form XML at this stage the decision to compress can be based on the type code. Any calls to blobdb.put with type code of CODES.form_xml (2) should compress the content before saving it to the backend.

In addition to compressing the data the blob metadata must record that the data was compressed by setting the compressed flat to True:

def put(self, content, **blob_meta_args):
    # This logic could be moved to the `MetaDB`
    blob_meta_args['compressed'] = blob_meta_args.get('type_code') == CODES.form_xml
    meta = self.metadb.new(**blob_meta_args)
    ...
    if meta.compressed:
        # consider creating a streamed compression here: https://stackoverflow.com/a/31566082/632517
        content = compress(content)
    ...
  1. Update get API to support decompressing

Currently the get API only takes the blob key which is used to fetch the blob directly from the backend. In order to support decompression the API will need to be modified to pass in the meta object as well:

def get(self, type_code, key=None, meta=None):
    if type_code == CODES.form_xml and not meta:
        raise ...
    blobstream = ....
    if meta and meta.compressed:
        return GzipFile(fileobj=blobstream, mode='r')
    else:
        return blobstream

Impact on users

No impact on end users is expected.

Impact on hosting

No impact is expected for hosting CommCare. This change should be completely transparent.

Backwards compatibility

Since we will likely not compress old data (at least not initially) it is a requirement that the code support compressed and uncompressed XML. This requirement will make the change backwards compatible by default.

Release Timeline

End of Q2 2020

Open questions and issues

  • Should the metadb.content_length be the compressed or uncompressed size
  • Should we update the size function
@snopoke snopoke added the CEP CommCare Enhancement Proposal label Mar 9, 2020
@snopoke
Copy link
Contributor Author

snopoke commented Mar 9, 2020

@satyaakam FYI

@snopoke snopoke changed the title [CEP] title [CEP] Form XML Compression in BlobDB Mar 9, 2020
@millerdev
Copy link
Contributor

This proposal seems like a good idea to me.

i. Compress XML in the xform pillow (preferred)

To reduce the likelihood of needing to decompress the XML during processing we could perform the compression in the pillow as a last step after processing. This would re-save the data back to the BlobDB overwriting the original uncompressed data.

Will doubling the number of writes (save each uncompressed XML blob, and then later save a compressed copy of the same) have any negative effects on the system? I.E., storage pricing, extra network bandwidth requirements, etc.

ii. Compress when saving form submissions during the initial request from mobile

In this case the XML is compressed when saving the form during initial processing. This should be done when saving the form after all processing is complete.

This is my preferred option due to the simplicity of implementation and reasoning about how the system works.

Compressing the XML here will mean that any subsequent reads of the XML, such as in the pillows, will need to decompress it.

I assume we can measure the impact, but I would expect the cost of decompression to be negligible, possibly even outweighing (less than) the cost of transferring a larger object over the wire.

--

Aside: 2KB/form is pretty small. I wonder if we could easily measure the difference in cost (of storage fees and operational performance) between storing XML in a system like S3 vs storing it in PostgreSQL? Hosting fees at ICDS scale may be prohibitively expensive for Postgres, but I imagine there is a point (smaller scale) at which the cost of maintaining a separate storage backend is more expensive and/or complex (a different type of expense), especially for hosting environments that run a local S3-compatible storage service.

@snopoke
Copy link
Contributor Author

snopoke commented Mar 10, 2020

@millerdev thanks for your input. I had not factored in that decompression is much faster and simpler than compression. I think you are correct that the compression should rather happen during the initial request processing which does make things much simpler.

@ctsims
Copy link
Member

ctsims commented Mar 10, 2020

Thanks for looping me in @snopoke .

Two thoughts on this approach:

  1. The vast, vast majority of data in Object Storage is XML objects which are read in the performance-sensitive processes, so it seems a bit odd to me that we wouldn't treat the compression as the default behavior for blobs rather than needing the code to decide explicitly whether to compress as an explicit operation. If we are concerned about, say, the cost of requesting compression of existing large files (like a 200mb ccz) for no benefit, we could just heuristically say "Only compress objects < 500kb", and let the blobdb decide when to do it.

  2. Is there a reason we wouldn't use the blobdb metadata record to provide the context for how the object is stored (gzip compressed, etc) rather than trying to determine it by data? One clear example of a downsize to determining by data is that we likely are already storing gzipped data in the blobdb (the ccz files are gzipped containers, right?) so trying to infer whether we should be decrypting inline based on binary headers seems like a bad idea to me.

Storing the notion of whether a blob is compressed has a few other advantages, including that it gives us an easier index to start processing uncompressed blobs to recover storage.

Long-term / moonshot it also gives us a place to store more complex info if we ever want to really get fancy with this. Daniel and I talked a long time ago about concepts like cold storage where we take forms which are effectively archived and collect up, say, 10-100 forms of the same type and compress all of them into one GZIP'd blob (which could get us another 10x compression ratio), and then store an index that says (Blob XYZ, Format: GZIP'd collection, item: ABC.xml). That's a lot easier if we build around the notion that the metadata record is a rich index to an abstract storage, rather than just a pointer.

@ctsims
Copy link
Member

ctsims commented Mar 10, 2020

@snopoke Did a quick (very low-N) test on compressing the idea of compressing forms together in batches of the same type. Ran these on the most growth monitoring forms which are smaller, and compress less well, which is why the starting average is lower. Leaving the numbers here in case useful to get an idea of the trend.

Forms Batched Average Size Compression Ratio
1 1742.8 36.13%
2 1255.1 26.02%
4 920.65 19.09%
10 693.35 14.37%
20 608.3 12.61%

@snopoke
Copy link
Contributor Author

snopoke commented Mar 11, 2020

  1. The vast, vast majority of data in Object Storage is XML objects which are read in the performance-sensitive processes, so it seems a bit odd to me that we wouldn't treat the compression as the default behavior for blobs rather than needing the code to decide explicitly whether to compress as an explicit operation. If we are concerned about, say, the cost of requesting compression of existing large files (like a 200mb ccz) for no benefit, we could just heuristically say "Only compress objects < 500kb", and let the blobdb decide when to do it.

Just to confirm here, you are suggesting we compress ALL data in the BlobDB as a matter of course - or based on some threshold. The main reason I didn't take this route is that we are only talking about form XML which is one of 17 different types of data we current store. Granted it is the most common type by a large factor. We do already decide explicitly to store compressed data in may places in HQ code e.g. data exports so this would be in keeping with that approach where it's not the BlobDB's prerogative to decide when compression is necessary. Changing that design is a much larger shift which would require changes to the core BlobDB API.

  1. Is there a reason we wouldn't use the blobdb metadata record to provide the context for how the object is stored (gzip compressed, etc) rather than trying to determine it by data?

I did consider this but since the current plan is to do the compression outside of the BlobDB it didn't make sense. If we were to use the metadata to determine if a blob is compressed or not we'd also have to change the BlobDB API currently does not read the metadata when fetching blobs (since it only needs the key field.

One clear example of a downsize to determining by data is that we likely are already storing gzipped data in the blobdb (the ccz files are gzipped containers, right?) so trying to infer whether we should be decrypting inline based on binary headers seems like a bad idea to me.

Since this proposal is only considering form XML the decompression would only be applied to code that is fetching those objects and not anything else. It is also not being applied inside the BlobDB.

Storing the notion of whether a blob is compressed has a few other advantages, including that it gives us an easier index to start processing uncompressed blobs to recover storage.

Yes, this is true though that could also be accomplished by simple date based filtering. This field would lose value once all XML is compressed (but I guess we could drop it after that). There is already a metadata field in the BlobDB model but it is an un-indexed text / JSON field so we couldn't use it for this purpose anyway. If we do go ahead and put this in the BlobDB itself then having the metadata field would be essential.

Long-term / moonshot it also gives us a place to store more complex info if we ever want to really get fancy with this. Daniel and I talked a long time ago about concepts like cold storage where we take forms which are effectively archived and collect up, say, 10-100 forms of the same type and compress all of them into one GZIP'd blob (which could get us another 10x compression ratio), and then store an index that says (Blob XYZ, Format: GZIP'd collection, item: ABC.xml). That's a lot easier if we build around the notion that the metadata record is a rich index to an abstract storage, rather than just a pointer.

I don't think this has any bearing on the current proposal. Regardless of how the XML compression is done it would still be possible to do the batching at a later stage. The info required to keep track of batched blobs is very different from a single 'is_compressed' field so I don't think it's much of a consideration unless we come up with a more flexible concept of blob metadata. My main concern with that is it's over engineering at this stage.

@millerdev I'd be interested to hear your thoughts on whether this should be implemented as part of the core BlobDB API or kept seprate.

@ctsims
Copy link
Member

ctsims commented Mar 11, 2020

@snopoke Thanks for the clarifications.

I think my mental model of the blobdb's role is more abstraction based (IE: I would have assumed that it was a transparent bits-in-bits-out interface which makes decisions about things like compression in the blobdb layer) than the current implementation. Agreed that my changes would be a bigger shift given that isn't how it's approached currently.

@millerdev
Copy link
Contributor

It does seem advantageous to use a database field to indicate that a blob is compressed rather than inspecting the content for magic bits. The latter seems easier to exploit in surprising ways. Blob metadata is a perfect place to store that information.

On the other hand, as you point out, blob metadata is not always loaded prior to loading the blob itself since currently only a "key" string is needed to load a blob using the blob db API. So we cannot store the compression flag on the metadata record for any blobs that are loaded by key alone.

On the other other hand, SQL form attachments (including form XML) are always loaded via their metadata (I think I'm right about that, but feel free to prove me wrong). We could require blob types that may be compressed (only CODES.form_xml at first) must always be loaded via their metadata record. To enforce this we could change the blobdb.get(...) method to require the blob type code (i.e., blobdb.get(key, type_code)).

I'm leaning toward +1 on adding a compression flag on the blob metadata, even if it is only ever used for CODES.form_xml blobs. In this world, blobmeta.open() returns the same bytes that were passed to blobdb.put() just as it does now, and the blob db will transparently handle compression and decompression based on blob type code and the metadata compression field. blobdb.get(key, type_code) will raise an error if passed a type code that could be compressed.

@sravfeyn
Copy link
Member

Since compression is more expensive compared to decompression, would we consider the option of mobile itself sending gzip compressed form XML? That would mean, only while reading we need to decompress it, but we can store it right away without the overhead of compressing.

@ctsims
Copy link
Member

ctsims commented Mar 13, 2020

@sravfeyn long term that could definitely be viable, but the lead/lag time for those kind of performance improvements to make it back upstream to HQ for the big use cases is really, really long. Also a little tricky to figure out how to establish compatibility without sending up multiple copies of the form (one compressed, one not), so I think it's a pretty high LOE change for the value it'll create.

@snopoke
Copy link
Contributor Author

snopoke commented Mar 18, 2020

I've updated the CEP to reflect the comments and discussion.

@millerdev
Copy link
Contributor

millerdev commented Mar 18, 2020

This is shaping up well. Regarding the get API, it should either accept

key and type_code

OR

meta alone.

def get(self, key=None, type_code=None, *, meta=None):
    key = validate_get_args(key, type_code, meta)
    blobstream = ....
    if meta and meta.compressed:
        return GzipFile(fileobj=blobstream, mode='r')
    return blobstream

def validate_get_args(key, type_code, meta):
    if key is not None or type_code is not None:
        if meta is not None:
            raise ValueError("'key' and 'meta' are mutually exclusive")
        if type_code == CODES.form_xml:
            raise ValueError("form XML must be loaded with 'meta' argument")
        if key is None or type_code is None:
            raise ValueError("'key' must be specified with 'type_code'")
        return key
    if meta is None:
        raise ValueError("'key' and 'type_code' or 'meta' is required")
    return meta.key

Edit: update order of checks in validate_get_args for better error output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CEP: done CEP CommCare Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

7 participants