-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does compressed JSON usually smaller than compressed Msgpack? #328
Comments
(Wild speculation follows.) I've not investigated this with practical data, but your results seem relatively intuitive to me for the following reasons: Generic compression algorithms like those you tried benefit from a good probability distribution with both sequences that appear very frequently and sequences that appear rarely: the algorithm would then select a very short encoding for the frequent symbols at the expense of using a longer encoding for the rarer symbols. A typical (minified) JSON document has lots of On the other hand, MessagePack is in some sense domain-specific compression, and so the problem here could be similar to trying to compress already-compressed data. Its encoding format does use shorter encodings for more common elements, such as packing small numbers into single bytes and minimizing the overhead of short arrays and maps. But I suspect that means that there's a worse distribution of high-frequency vs low-frequency sequences for a general compression algorithm to benefit from. For example, there are 18 distinct "delimiters" for arrays of different sizes in MessagePack, and so I would guess (haven't measured) that the probability of any one of those in a typical document is lower than any of JSON's more repetitive delimiters. A compression algorithm can still chew on stuff like repetitive map keys, but unless the strings are full of escape sequences (unlikely for object attributes, I think) their contents would have equal size in both MessagePack and JSON. Without carefully studying the output of the compressors for your real input I can only speculate, of course. Truly answering this question would, I think, require taking equivalent structures in both JSON and MessagePack, compressing them both with the same algorithm, and then studying the resulting compressed encoding in detail to understand exactly what tradeoffs the compressor made. I don't really feel motivated to do that sort of analysis, since my intuition already agrees with your result. 😀 MessagePack should typically still "win" in situations where using compression is infeasible for some reason, but as you implied those situations are becoming fewer over time. 🤷♂️ |
This is convincing. Here is my understanding. Take MsgPack's official demo as an example:
I think this is the point. It indicates that Length-Prefixed serialization languages is less friendly to compressors compared to Delimiter-Separated serialization languages indeed. Your insights inspire me a lot, Thank you very much!! |
So I personally conclude as follow:
|
I would like to add to this that:
Honestly the third point makes MP my default simply because it's far more type-safe without significant work. Not saying JSON can't also be type-safe, but it's harder to make auto-convert's for Set's, Map's or BigInt's. Which alone is enough for me. |
In my experience, when people complain about size differences between JSON and MessagePack, 99% of the time it's because the data is full of real numbers that get encoded as MessagePack doubles. They end up being 9 bytes in the MessagePack as opposed to a decimal expansion in the JSON. (I don't know why the above wild speculation doesn't address this. It's always the doubles.) You mentioned this is OpenAI embeddings data. Is it this? Sure looks like a lot of real numbers there. Can you attach a JSON response here so we don't need an API key to test this? All of the compression algorithms you mentioned work in the domain of bytes so they don't do a good job of compressing floats. A decimal expansion near zero with a prefix like "-0.00" and 14 decimal digits is probably more compressible than a double that has 8 incompressible bytes plus the tag prefix. |
I benchmarked JSON and Msgpack in some real data, but it shows that after compression, data encoded by msgpack is always larger than JSON, although raw msgpack is smaller than JSON. I've tested brotli, lzma, blosc on python.
There is a common use case and here is an Reproducible example:
I wonder why and I am thinking maybe it is not worthy to use Msgpack in Web responses (because almost every browser supports compressing nowdays)? No offence, I was a big fan of Msgpack and used to use it everywhere.
I find this already discussed in #203 but I've also tested msgpack on data of string (like OpenAI's chat completion response), and compressed JSON is still a bit smaller. I am confusing. Isn't Length-Prefixed Data better than Delimiter-Separated Data?
The text was updated successfully, but these errors were encountered: