Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed output format (jsongz) for rethinkdb export/import #251

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

iantocristian
Copy link

@iantocristian iantocristian commented Feb 19, 2021

Reason for the change
#249

Description
Implemented feature in rethindkb-export and rethinkdb-import scripts: a new export format jsongz - gzipped json.
On export side, there is a new jsongz writer (based on the json writer implementation), passing the output through a zilb compressor.
On the import side, JsonGzSourceFile extends JsonSourceFile (slightly modified) and can read from the gzipped json data files directly. And an addition to SourceFile constructor to read the uncompressed size from the gzip trailer.

Checklist

References

Usage:
rethinkdb-export -e test -d export --format jsongz
rethinkdb-export -e test -d export --format jsongz --compression-level 5
rethinkdb-import -i test -d export

Tested with python 2.7.16 and python 3.8.5

@gabor-boros
Copy link
Member

Hello @iantocristian 👋
First of all, thank you for your contribution here! 🎉

Could you please add some unit/integration tests for this functionality?

@iantocristian
Copy link
Author

👋 @gabor-boros

I would but it looks like no unit/integration tests exist for the import/export scripts in general 😅 . Seems like a big job.

gabor-boros
gabor-boros previously approved these changes Mar 23, 2021
@gabor-boros
Copy link
Member

@lsabi could you please double check this?

Copy link
Contributor

@lsabi lsabi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all it seems good. The style has been maintained but there are missing tests, which could become a problem.

We could write them in a second moment.

Apart from the comments I've added it seems ok. @gabor-boros do you have anything to add? Especially for the comment about the new line

if options.format == "jsongz":
if options.compression_level is None:
options.compression_level = -1
elif options.compression_level < 0 or options.compression_level > 9:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone passes -1 as option?
In my opinion it should be changed into elif options.compression_level < -1 or options.compression_level > 9:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was: passing -1 is the same as not specifying the compression level, it's not really setting the compression_level.
Happy to make the change as suggested though, it might make it easier to switch between setting and not setting the compression level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your reasoning, but then if someone passes -1, if options.compression_level is None is evaluated False, if options.compression_level < 0 or options.compression_level > 9 is evaluated to True and raises an exception. Which is incorrect, since -1 is an acceptable value.

If you have another suggestion on how to handle such situation, feel free to add it. Mine was just a potential suggestion on how to handle it.

All in all, It's no big deal to support -1, but could prevent some errors/exceptions, since I assume that the default value is an acceptable value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

for item in list(row.keys()):
if item not in fields:
del row[item]
if first:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that the objects in the JSON array are each on a separate line.

I'm no compression expert, but since it'll be binary and unreadable from a high level perspective, why not skip the new line? Object would be written as follows
[{...},{...},{...}...{...}]
which would save n + 1 + 1 new lines (n-1 between each object, plus 2 for the first and the last, + 1 for the last row which EOF and is good. On huge table this could imply saving a considerable amount of bits. @gabor-boros do you have any knowledge about the topic?

Or maybe I'm wrong and the \n are used for compression. Let me know, I'm curious now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is replicating what the json export does. \n are not used for compression.

One can still unpack the jsongz file get the json file from inside (it's a standard gzip file), in which case the formatting might help. I considered the gains from removing the \n-s marginal, but you might be right. If you have really small documents, the extra end lines might make a difference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say that import script includes a custom json paarser (which I found odd, not sure what the reason for using a custom parser was, performance perhaps?) which might be affected by the lack of new lines (I expect it to be happy though).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have really small documents, the extra end lines might make a difference

Did you mean big documents? Hehe

Regarding the custom json parser, I have no clue why there's a custom one. Probably, when the library has been written, there were parsers that did not fit/match the requirements. Nowadays there are tons of high performance parsers. In order to not break anything, I would keep the custom one for now.

@iantocristian
Copy link
Author

... there are missing tests, which could become a problem. We could write them in a second moment.

I am on the same page here. We need some tests for the backup and restore functionality. But it feels like a different story / PR is in order.

@iantocristian
Copy link
Author

iantocristian commented Mar 24, 2021

One other comment I got was related to the jsongz extension I used for the data files - why not json.gz? I used jsongz because splitext can't handle json.gz and it would have required more code changes elsewhere in the scripts.

Downside of using jsongz is that unpacking is more cumbersome - in most cases requiring the extension to be changed before unpacking (e.g. gzip -D command won't like the jsongz extension). Any thoughts about this?

@lsabi
Copy link
Contributor

lsabi commented Mar 24, 2021

I haven't worked on the import nor export scripts, but an option is to have a list of supported extensions and try to match the last part of the filename against the list.

Another alternative, could be to have a hash list with the supported extensions pointing towards other extensions like

SUPPORTED_EXT = {
    "json": True,
    "gz": {"json": True}
} 

This way, files ending with json can be decoded immediately, while those ending with gz have to have json before the gz which is to be checked. Although I don't know how much work it could be performing the switch.

@lsabi
Copy link
Contributor

lsabi commented Mar 26, 2021

To me it looks good.

Only the the new line feed which I don't know if it's worth removing it or not.

We can, in a second moment, write tests and check how much it influences the size of the generated file.

@gabor-boros what do you think? For me it can be passed

@iantocristian
Copy link
Author

Only the the new line feed which I don't know if it's worth removing it or not.

It's one new line per document right? + another one at the start and another one end.

For a table with 1000 documents with average size of 1kb, you gain 1002 bytes, uncompressed, less than 0.1% gain.
For a table with 10000 documents with average size 200 bytes, you gain 10002 bytes, uncompressed, roughly 0.5% gain.
Anything larger than 1kb gain is negligible.

Not worth it imo.

Another point is that jsongz = gzipped json. So it should be the same output as json, but compressed.

@lsabi
Copy link
Contributor

lsabi commented Mar 28, 2021

Percentages vary based on the size of the document. But, if you have a table with millions of records, it implies saving MBs of space. Sure it'll not be much in comparison to the total size, but it may be better and easier to fit it into memory. I'm not sure there'll be such a big table, though.

Nevertheless, as I said, this can be done in a second moment.

What's the point about the jsongz? I don't understand it

@iantocristian
Copy link
Author

iantocristian commented Mar 30, 2021

Nevertheless, as I said, this can be done in a second moment.

👍

What's the point about the jsongz? I don't understand it

That it wasn't my intention to change the content that's being dumped, just to compress it.
Could have been an option for the json_writer but I thought it's less risk to have a separate writer.

@lsabi
Copy link
Contributor

lsabi commented Mar 30, 2021

Don't worry, we can keep them separate and merge one day.

@gabor-boros do you have anything to add/to complain about this PR?

@AlexC
Copy link

AlexC commented Mar 22, 2024

@lsabi / @gabor-boros just wondering if there was an update on getting this merged in? It would be super useful for us. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants