String length expressed as byte or character count for bencode #92

trantor · 2021-05-19T15:30:37Z

Hello.

First of all, thanks a lot for the tool.
I am, however, encountering problems when dealing with data encoded with bencode.
It's a problem I've come across time and again and hopefully one you can address.
From what I've seen you've interpreted the string length as the number of bytes the string is encoded as, which should be fine.
Since, I guess, the original specs of the format, if we can call them that, were less than crystal clear as to what string length meant, there are many implementations around interpreting the string length as the character count, in Unicode terms the count of codepoints present in the string.
Could you create a variant of the bencode format supported by faq that matches the variant interpretation of string length described above? It would make my life a lot easier dealing with these sorts of systems.

Just as a reference, faq would encode (arguably correctly) the JSON { "a": "à" } as the bencode-d d1:a2:àe, while the variant format would encode it as d1:a1:àe, assuming UTF-8 encoded strings.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

jzelinskie · 2021-05-20T19:00:25Z

Hey, that's a pretty interesting problem that I haven't personally run into, even having worked on a pretty widely deployed bencode implementation (on a completely unrelated project).

Do you know of a way to reliably determine whether a file should be interpreted as the variant interpretation and when it should not? Also, do you have any examples of implementations of bencode that support this (even if they're in other langauges)?

trantor · 2021-05-22T14:41:00Z

Hello @jzelinskie
Well, apart from falling back on the variant format and viceversa if the encoding/decoding using the other fails, I don't thinks there's a reliable way to distinguish the two. After all they contain the same data and they diverged due to a different/mistaken interpretation of the format.

As for a practical example of a software using the interpretation I was referring to, you can look https://github.com/Zimbra/zm-mailbox/blob/develop/common/src/java/com/zimbra/common/util/BEncoding.java here for the serialization functions used by the Zimbra Communication Suite in its Java code, i.e. the source of my annoyance ;D .
As to a non-internal implementation dealing with such a variation on the theme, I've used some Perl module to deal with it, but I trace it working, I think, to the "flexible" way Perl can allow you to see a string scalar variable as if you don't specifically force it to be a byte-string.

trantor · 2021-08-28T16:04:58Z

Following up on this, my problem ended up being with an implementation expressing string length as a the count of UTF-16 code units used to represent the string. Pretty removed from the standard implementation, yet it exists.
In the end, given my urgency and other implementation problems concerning Bencode I found in faq and reported #93 I threw myself in the deep end of the pool and wrote a modular Bencode decoder/encoder for jq, implementing alternative string length algorithms, which proved interesting although pretty mind-wracking (or wrecking even).
To anyone who might need it, the code in question is here.

jzelinskie added the exploratory Research and opinions are needed label May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String length expressed as byte or character count for bencode #92

String length expressed as byte or character count for bencode #92

trantor commented May 19, 2021

jzelinskie commented May 20, 2021

trantor commented May 22, 2021

trantor commented Aug 28, 2021

String length expressed as byte or character count for bencode #92

String length expressed as byte or character count for bencode #92

Comments

trantor commented May 19, 2021

jzelinskie commented May 20, 2021

trantor commented May 22, 2021

trantor commented Aug 28, 2021