Compressed output format (jsongz) for rethinkdb export/import #251

iantocristian · 2021-02-19T18:19:11Z

Reason for the change
#249

Description
Implemented feature in rethindkb-export and rethinkdb-import scripts: a new export format jsongz - gzipped json.
On export side, there is a new jsongz writer (based on the json writer implementation), passing the output through a zilb compressor.
On the import side, JsonGzSourceFile extends JsonSourceFile (slightly modified) and can read from the gzipped json data files directly. And an addition to SourceFile constructor to read the uncompressed size from the gzip trailer.

Checklist

I have read and agreed to the RethinkDB Contributor License Agreement

References

Usage:
rethinkdb-export -e test -d export --format jsongz
rethinkdb-export -e test -d export --format jsongz --compression-level 5
rethinkdb-import -i test -d export

Tested with python 2.7.16 and python 3.8.5

gabor-boros · 2021-03-05T21:25:33Z

Hello @iantocristian 👋
First of all, thank you for your contribution here! 🎉

Could you please add some unit/integration tests for this functionality?

iantocristian · 2021-03-06T11:30:22Z

👋 @gabor-boros

I would but it looks like no unit/integration tests exist for the import/export scripts in general 😅 . Seems like a big job.

gabor-boros · 2021-03-23T12:17:32Z

@lsabi could you please double check this?

lsabi

All in all it seems good. The style has been maintained but there are missing tests, which could become a problem.

We could write them in a second moment.

Apart from the comments I've added it seems ok. @gabor-boros do you have anything to add? Especially for the comment about the new line

lsabi · 2021-03-23T21:43:09Z

rethinkdb/_export.py

+    if options.format == "jsongz":
+        if options.compression_level is None:
+            options.compression_level = -1
+        elif options.compression_level < 0 or options.compression_level > 9:


What if someone passes -1 as option?
In my opinion it should be changed into elif options.compression_level < -1 or options.compression_level > 9:

My reasoning was: passing -1 is the same as not specifying the compression level, it's not really setting the compression_level.
Happy to make the change as suggested though, it might make it easier to switch between setting and not setting the compression level.

I understand your reasoning, but then if someone passes -1, if options.compression_level is None is evaluated False, if options.compression_level < 0 or options.compression_level > 9 is evaluated to True and raises an exception. Which is incorrect, since -1 is an acceptable value.

If you have another suggestion on how to handle such situation, feel free to add it. Mine was just a potential suggestion on how to handle it.

All in all, It's no big deal to support -1, but could prevent some errors/exceptions, since I assume that the default value is an acceptable value.

Updated as suggested.

lsabi · 2021-03-23T22:04:23Z

rethinkdb/_export.py

+                    for item in list(row.keys()):
+                        if item not in fields:
+                            del row[item]
+                if first:


This implies that the objects in the JSON array are each on a separate line.

I'm no compression expert, but since it'll be binary and unreadable from a high level perspective, why not skip the new line? Object would be written as follows
[{...},{...},{...}...{...}]
which would save n + 1 + 1 new lines (n-1 between each object, plus 2 for the first and the last, + 1 for the last row which EOF and is good. On huge table this could imply saving a considerable amount of bits. @gabor-boros do you have any knowledge about the topic?

Or maybe I'm wrong and the \n are used for compression. Let me know, I'm curious now.

The code here is replicating what the json export does. \n are not used for compression.

One can still unpack the jsongz file get the json file from inside (it's a standard gzip file), in which case the formatting might help. I considered the gains from removing the \n-s marginal, but you might be right. If you have really small documents, the extra end lines might make a difference.

Should say that import script includes a custom json paarser (which I found odd, not sure what the reason for using a custom parser was, performance perhaps?) which might be affected by the lack of new lines (I expect it to be happy though).

if you have really small documents, the extra end lines might make a difference

Did you mean big documents? Hehe

Regarding the custom json parser, I have no clue why there's a custom one. Probably, when the library has been written, there were parsers that did not fit/match the requirements. Nowadays there are tons of high performance parsers. In order to not break anything, I would keep the custom one for now.

iantocristian · 2021-03-24T10:57:52Z

... there are missing tests, which could become a problem. We could write them in a second moment.

I am on the same page here. We need some tests for the backup and restore functionality. But it feels like a different story / PR is in order.

iantocristian · 2021-03-24T12:08:50Z

One other comment I got was related to the jsongz extension I used for the data files - why not json.gz? I used jsongz because splitext can't handle json.gz and it would have required more code changes elsewhere in the scripts.

Downside of using jsongz is that unpacking is more cumbersome - in most cases requiring the extension to be changed before unpacking (e.g. gzip -D command won't like the jsongz extension). Any thoughts about this?

lsabi · 2021-03-24T20:55:38Z

I haven't worked on the import nor export scripts, but an option is to have a list of supported extensions and try to match the last part of the filename against the list.

Another alternative, could be to have a hash list with the supported extensions pointing towards other extensions like

SUPPORTED_EXT = {
    "json": True,
    "gz": {"json": True}
}

This way, files ending with json can be decoded immediately, while those ending with gz have to have json before the gz which is to be checked. Although I don't know how much work it could be performing the switch.

lsabi · 2021-03-26T20:58:26Z

To me it looks good.

Only the the new line feed which I don't know if it's worth removing it or not.

We can, in a second moment, write tests and check how much it influences the size of the generated file.

@gabor-boros what do you think? For me it can be passed

iantocristian · 2021-03-27T08:27:31Z

Only the the new line feed which I don't know if it's worth removing it or not.

It's one new line per document right? + another one at the start and another one end.

For a table with 1000 documents with average size of 1kb, you gain 1002 bytes, uncompressed, less than 0.1% gain.
For a table with 10000 documents with average size 200 bytes, you gain 10002 bytes, uncompressed, roughly 0.5% gain.
Anything larger than 1kb gain is negligible.

Not worth it imo.

Another point is that jsongz = gzipped json. So it should be the same output as json, but compressed.

lsabi · 2021-03-28T20:40:30Z

Percentages vary based on the size of the document. But, if you have a table with millions of records, it implies saving MBs of space. Sure it'll not be much in comparison to the total size, but it may be better and easier to fit it into memory. I'm not sure there'll be such a big table, though.

Nevertheless, as I said, this can be done in a second moment.

What's the point about the jsongz? I don't understand it

iantocristian · 2021-03-30T12:20:30Z

Nevertheless, as I said, this can be done in a second moment.

👍

What's the point about the jsongz? I don't understand it

That it wasn't my intention to change the content that's being dumped, just to compress it.
Could have been an option for the json_writer but I thought it's less risk to have a separate writer.

lsabi · 2021-03-30T20:28:50Z

Don't worry, we can keep them separate and merge one day.

@gabor-boros do you have anything to add/to complain about this PR?

AlexC · 2024-03-22T14:35:22Z

@lsabi / @gabor-boros just wondering if there was an update on getting this merged in? It would be super useful for us. Thanks

iantocristian added 2 commits February 19, 2021 20:10

feat: jsongz export/import

5ec52e0

chore: pylintify

f6e3e40

iantocristian force-pushed the gzip-json-export branch from 23176e9 to f6e3e40 Compare February 19, 2021 18:45

iantocristian added 2 commits March 22, 2021 20:40

fix: allow single file jsongz import

a6d8978

fix: apply --max-document-size for directory imports too

a88f11c

gabor-boros previously approved these changes Mar 23, 2021

View reviewed changes

lsabi reviewed Mar 23, 2021

View reviewed changes

fix: allow compression_level -1

1e61100

iantocristian dismissed gabor-boros’s stale review via 1e61100 March 26, 2021 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressed output format (jsongz) for rethinkdb export/import #251

Compressed output format (jsongz) for rethinkdb export/import #251

iantocristian commented Feb 19, 2021 •

edited

gabor-boros commented Mar 5, 2021

iantocristian commented Mar 6, 2021

gabor-boros commented Mar 23, 2021

lsabi left a comment

lsabi Mar 23, 2021

iantocristian Mar 24, 2021 •

edited

lsabi Mar 24, 2021

iantocristian Mar 26, 2021

lsabi Mar 23, 2021

iantocristian Mar 24, 2021

iantocristian Mar 24, 2021 •

edited

lsabi Mar 24, 2021

iantocristian commented Mar 24, 2021

iantocristian commented Mar 24, 2021 •

edited

lsabi commented Mar 24, 2021

lsabi commented Mar 26, 2021

iantocristian commented Mar 27, 2021

lsabi commented Mar 28, 2021

iantocristian commented Mar 30, 2021 •

edited

lsabi commented Mar 30, 2021

AlexC commented Mar 22, 2024

Compressed output format (jsongz) for rethinkdb export/import #251

Are you sure you want to change the base?

Compressed output format (jsongz) for rethinkdb export/import #251

Conversation

iantocristian commented Feb 19, 2021 • edited

gabor-boros commented Mar 5, 2021

iantocristian commented Mar 6, 2021

gabor-boros commented Mar 23, 2021

lsabi left a comment

Choose a reason for hiding this comment

lsabi Mar 23, 2021

Choose a reason for hiding this comment

iantocristian Mar 24, 2021 • edited

Choose a reason for hiding this comment

lsabi Mar 24, 2021

Choose a reason for hiding this comment

iantocristian Mar 26, 2021

Choose a reason for hiding this comment

lsabi Mar 23, 2021

Choose a reason for hiding this comment

iantocristian Mar 24, 2021

Choose a reason for hiding this comment

iantocristian Mar 24, 2021 • edited

Choose a reason for hiding this comment

lsabi Mar 24, 2021

Choose a reason for hiding this comment

iantocristian commented Mar 24, 2021

iantocristian commented Mar 24, 2021 • edited

lsabi commented Mar 24, 2021

lsabi commented Mar 26, 2021

iantocristian commented Mar 27, 2021

lsabi commented Mar 28, 2021

iantocristian commented Mar 30, 2021 • edited

lsabi commented Mar 30, 2021

AlexC commented Mar 22, 2024

iantocristian commented Feb 19, 2021 •

edited

iantocristian Mar 24, 2021 •

edited

iantocristian Mar 24, 2021 •

edited

iantocristian commented Mar 24, 2021 •

edited

iantocristian commented Mar 30, 2021 •

edited