Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base 91 vs Base 85 #320

Open
Theelx opened this issue Sep 15, 2020 · 7 comments
Open

Base 91 vs Base 85 #320

Theelx opened this issue Sep 15, 2020 · 7 comments

Comments

@Theelx
Copy link
Contributor

Theelx commented Sep 15, 2020

Background:

So, I was going through binary-to-text encoding the other day after I was intrigued by base 85. I found that, in addition to base 85 and base 64, there's a base 91 standard that's slightly better than base 85 (bloats by 23% vs 25%). I found a nice repo for base 91 that says it's compatible with python 2 and 3 here, though I was only able to test on python 3.8. There's no option for base91 encoding/decoding in the stdlib, but luckily this repo has extremely simple code for it, under 100 lines total. The only downside I can see is that it comes with the same license as jsonpickle, so if you're okay with providing the appropriate copyright notice at the top of util.py, it should be fine. I made a fork of jsonpickle with updated code for base 91 encoding here, and tested it with tox.

Performance

Since I'm aware that this base91 implementation may take up more CPU in pure python, than if it was implemented in C, I simply added it as an optional argument. Base 85 encoding/decoding is also implemented in pure python, however base 64 is implemented in C. Base91 bloats by about 23% compared to base85, which bloats about 25%. Some sample bloating data is described in a Github repo here, and another here.

Rationale

I'm very focused on performance, specifically memory usage, and I'm willing to take as many performance gains as I can get, however small (bloating 23% for base91 vs 25% for base85). I'm pretty sure there are other people out there looking for similar gains as I am.

I'm waiting on submitting a PR until I can get some feedback.

@Theelx
Copy link
Contributor Author

Theelx commented Sep 15, 2020

Another Option

One option I came across a few minutes ago was base 100 encoding/decoding. The example code in a GitHub repo here looks simpler and possible more cpu-efficient than base91, it requires no imports at all, not even from the stdlib, and it comes with the "Unlicense", which is basically a license without any conditions. This would likely be more efficient than base91.

Other

There's also a base122 encoding scheme in Javascript that's meant for html webpages that I'm looking at now. It promises a 14.3% overhead vs 25% for base85 and 22-23% for base91, but it isn't 8-bit-clean so there would need to be a warning not to use it on UTF-8 characters. However, as far as I can google, there's no implementation in Python yet. I'm currently making a Python version of it, and I'd be happy to contribute it to jsonpickle when it's done.

Edit: I made a python version on GitHub here. Currently, it takes about 3x as long as base85 and 10-15x longer than base64 to do a full encoding/decoding, however I'm still working on improving it. If you're okay with including Cython as a dependency or including precompiled binaries, I can make it just as fast as base85.

@laike9m
Copy link
Contributor

laike9m commented Nov 18, 2020

+1

@davvid
Copy link
Member

davvid commented Nov 23, 2020

👍 Thanks for looking into this! You've already implemented this in an opt-in fashion so this should be very straightforward to accept. I dig it, thanks for scratching away at these performance improvements.

I haven't used Cython much, but if it can gracefully degrade and things will still work in pure-python mode then it sounds pretty good. I just want to make sure that existing users that can't use cython for one reason or another are not negatively impacted.

Nicely done! This sounds really great for reducing network IO.

It is getting a little complicated with use_base85, use_base91 and later possibly use_base100 and other booleans, so it wouldn't be a bad idea to refactor it to allow something something like encode_bytes='base91' (defaults to None). That way we won't end up with a bunch of booleans with semantically invalid states (eg. when they're all True) but rather a single field to convey the encoding strategy.

We try to avoid breaking existing users whenever possible so the current booleans would probably have to become "deprecated but supported" for a long period of time.

@Theelx
Copy link
Contributor Author

Theelx commented Nov 23, 2020

Ok got it. The base 100 idea didn't really pan out, however I got base122 down to about twice as fast as base85 with Cython, and I believe the pure python version is still about the same speed as it was in September. However, sending the Cython version will PyPi while maintaining compatibility with Windows will be a difficult task, I tried it a while back and couldn't get the hang of it, as if we include it on Windows, the downloading machine needs the Windows C++ build tools, and many people don't have that or want to bother to download it.

So in summary, I can refactor the base91 repo to include base122 as an option, and change the encoding option to encode_bytes='base[64|85|91|122]'. Including the Cython version of base122 will require me to get some advice from people experienced in it, but I think it's doable within a few days. I estimate I can get this ready for a PR within a week.

@Theelx
Copy link
Contributor Author

Theelx commented Nov 23, 2020

Oh also, shouldn't the encode_bytes param default to 'base64'? Or are you thinking something like encoding = encode_bytes or 'base64'?
Ignore this, I missed the last paragraph of your previous message.

@davvid
Copy link
Member

davvid commented Nov 29, 2020

I hope you don't mind me responding to a message you asked for me to ignore. Either defaulting to base64 or using something like encode_bytes or 'base64' in the function body seems sensible. In practice I suspect there will have to be a bit of logic in the function body to retain the existing behavior.

None might provide a way to know that the user hasn't specified anything, so that could be an advantage of using it from the implementation's perspective. The downside, though, is that someone reading the docs is better served by seeing base64 as the default value for the parameter. If in doubt, err on the side of simplicity for the user, future extensibility, and simplicity for the implementation, in that order. A small note such as, The default value for 'encode_bytes' may change in the future, so please specify a value if the choice of encoding is important to your use case might help guide users towards making a choice when needed.

Having a section in the docs that describes the pros/cons of each encoding strategy (similar to the performance metrics you shared here) could be very helpful to include for posterity as well. I suspect some use cases will benefit from being able to trade cpu cycles for IO efficiency.

Thanks again for taking a look.

@Theelx
Copy link
Contributor Author

Theelx commented Nov 29, 2020

Ok, that sounds good, I'll make sure to include that in the documentation. I'm thinking that encode_bytes will default to base64, and if it's set to anything else, use that over the use_base64 and use_base85 switches.

Also, I've been looking at how to include wheels for the Cython version so users don't have to build it on their side, and I've found that it's pretty easy to build wheels with https://github.com/joerick/cibuildwheel. The great part is that that cibuildwheel tool supports GitHub Actions, though I'm not entirely sure how one would use Actions in combination with it. If we weren't to use cibuildwheel with GitHub Actions, it would require someone to build the wheels on a Linux machine, Windows machine, and MacOS machine. I could do Linux and Windows wheels, but not the MacOS, someone else would have to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants