Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String length expressed as byte or character count for bencode #92

Open
trantor opened this issue May 19, 2021 · 3 comments
Open

String length expressed as byte or character count for bencode #92

trantor opened this issue May 19, 2021 · 3 comments
Labels
exploratory Research and opinions are needed

Comments

@trantor
Copy link

trantor commented May 19, 2021

Hello.

First of all, thanks a lot for the tool.
I am, however, encountering problems when dealing with data encoded with bencode.
It's a problem I've come across time and again and hopefully one you can address.
From what I've seen you've interpreted the string length as the number of bytes the string is encoded as, which should be fine.
Since, I guess, the original specs of the format, if we can call them that, were less than crystal clear as to what string length meant, there are many implementations around interpreting the string length as the character count, in Unicode terms the count of codepoints present in the string.
Could you create a variant of the bencode format supported by faq that matches the variant interpretation of string length described above? It would make my life a lot easier dealing with these sorts of systems.

Just as a reference, faq would encode (arguably correctly) the JSON { "a": "à" } as the bencode-d d1:a2:àe, while the variant format would encode it as d1:a1:àe, assuming UTF-8 encoded strings.

Thanks in advance.

@jzelinskie jzelinskie added the exploratory Research and opinions are needed label May 20, 2021
@jzelinskie
Copy link
Owner

Hey, that's a pretty interesting problem that I haven't personally run into, even having worked on a pretty widely deployed bencode implementation (on a completely unrelated project).

Do you know of a way to reliably determine whether a file should be interpreted as the variant interpretation and when it should not? Also, do you have any examples of implementations of bencode that support this (even if they're in other langauges)?

@trantor
Copy link
Author

trantor commented May 22, 2021

Hello @jzelinskie
Well, apart from falling back on the variant format and viceversa if the encoding/decoding using the other fails, I don't thinks there's a reliable way to distinguish the two. After all they contain the same data and they diverged due to a different/mistaken interpretation of the format.

As for a practical example of a software using the interpretation I was referring to, you can look https://github.com/Zimbra/zm-mailbox/blob/develop/common/src/java/com/zimbra/common/util/BEncoding.java here for the serialization functions used by the Zimbra Communication Suite in its Java code, i.e. the source of my annoyance ;D .
As to a non-internal implementation dealing with such a variation on the theme, I've used some Perl module to deal with it, but I trace it working, I think, to the "flexible" way Perl can allow you to see a string scalar variable as if you don't specifically force it to be a byte-string.

@trantor
Copy link
Author

trantor commented Aug 28, 2021

Following up on this, my problem ended up being with an implementation expressing string length as a the count of UTF-16 code units used to represent the string. Pretty removed from the standard implementation, yet it exists.
In the end, given my urgency and other implementation problems concerning Bencode I found in faq and reported #93 I threw myself in the deep end of the pool and wrote a modular Bencode decoder/encoder for jq, implementing alternative string length algorithms, which proved interesting although pretty mind-wracking (or wrecking even).
To anyone who might need it, the code in question is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exploratory Research and opinions are needed
Projects
None yet
Development

No branches or pull requests

2 participants