New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base 91 vs Base 85 #320
Comments
Another OptionOne option I came across a few minutes ago was base 100 encoding/decoding. The example code in a GitHub repo here looks simpler and possible more cpu-efficient than base91, it requires no imports at all, not even from the stdlib, and it comes with the "Unlicense", which is basically a license without any conditions. This would likely be more efficient than base91. OtherThere's also a base122 encoding scheme in Javascript that's meant for html webpages that I'm looking at now. It promises a 14.3% overhead vs 25% for base85 and 22-23% for base91, but it isn't 8-bit-clean so there would need to be a warning not to use it on UTF-8 characters. Edit: I made a python version on GitHub here. Currently, it takes about 3x as long as base85 and 10-15x longer than base64 to do a full encoding/decoding, however I'm still working on improving it. If you're okay with including Cython as a dependency or including precompiled binaries, I can make it just as fast as base85. |
+1 |
👍 Thanks for looking into this! You've already implemented this in an opt-in fashion so this should be very straightforward to accept. I dig it, thanks for scratching away at these performance improvements. I haven't used Cython much, but if it can gracefully degrade and things will still work in pure-python mode then it sounds pretty good. I just want to make sure that existing users that can't use cython for one reason or another are not negatively impacted. Nicely done! This sounds really great for reducing network IO. It is getting a little complicated with We try to avoid breaking existing users whenever possible so the current booleans would probably have to become "deprecated but supported" for a long period of time. |
Ok got it. The base 100 idea didn't really pan out, however I got base122 down to about twice as fast as base85 with Cython, and I believe the pure python version is still about the same speed as it was in September. However, sending the Cython version will PyPi while maintaining compatibility with Windows will be a difficult task, I tried it a while back and couldn't get the hang of it, as if we include it on Windows, the downloading machine needs the Windows C++ build tools, and many people don't have that or want to bother to download it. So in summary, I can refactor the base91 repo to include base122 as an option, and change the encoding option to |
|
I hope you don't mind me responding to a message you asked for me to ignore. Either defaulting to
Having a section in the docs that describes the pros/cons of each encoding strategy (similar to the performance metrics you shared here) could be very helpful to include for posterity as well. I suspect some use cases will benefit from being able to trade cpu cycles for IO efficiency. Thanks again for taking a look. |
Ok, that sounds good, I'll make sure to include that in the documentation. I'm thinking that encode_bytes will default to Also, I've been looking at how to include wheels for the Cython version so users don't have to build it on their side, and I've found that it's pretty easy to build wheels with https://github.com/joerick/cibuildwheel. The great part is that that cibuildwheel tool supports GitHub Actions, though I'm not entirely sure how one would use Actions in combination with it. If we weren't to use cibuildwheel with GitHub Actions, it would require someone to build the wheels on a Linux machine, Windows machine, and MacOS machine. I could do Linux and Windows wheels, but not the MacOS, someone else would have to do that. |
Background:
So, I was going through binary-to-text encoding the other day after I was intrigued by base 85. I found that, in addition to base 85 and base 64, there's a base 91 standard that's slightly better than base 85 (bloats by 23% vs 25%). I found a nice repo for base 91 that says it's compatible with python 2 and 3 here, though I was only able to test on python 3.8. There's no option for base91 encoding/decoding in the stdlib, but luckily this repo has extremely simple code for it, under 100 lines total. The only downside I can see is that it comes with the same license as jsonpickle, so if you're okay with providing the appropriate copyright notice at the top of util.py, it should be fine. I made a fork of jsonpickle with updated code for base 91 encoding here, and tested it with tox.
Performance
Since I'm aware that this base91 implementation may take up more CPU in pure python, than if it was implemented in C, I simply added it as an optional argument. Base 85 encoding/decoding is also implemented in pure python, however base 64 is implemented in C. Base91 bloats by about 23% compared to base85, which bloats about 25%. Some sample bloating data is described in a Github repo here, and another here.
Rationale
I'm very focused on performance, specifically memory usage, and I'm willing to take as many performance gains as I can get, however small (bloating 23% for base91 vs 25% for base85). I'm pretty sure there are other people out there looking for similar gains as I am.
I'm waiting on submitting a PR until I can get some feedback.
The text was updated successfully, but these errors were encountered: