New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String Kind as Unicode #203
Comments
Marking this as a good contribution for someone to attempt to make, although there's landmines all through this topic. Doc here has the key content: https://github.com/ipld/ipld/blob/master/docs/data-model/kinds.md#string-kind Some of the landmines are present in that doc, but I think the framing that @Stebalien is proposing here focusing on "text" and then "unicode". We focus on UTF-8 in the current doc, because that's really the dominant binary encoding form of unicode characters, but maybe, as @Stebalien is suggesting, that's getting far too into the codec-weeds when the topic is really about the data model, which isn't about encoding. I'm still fine mentioning UTF-8, simply because it's going to be the most common encoding format and is a useful reference point (e.g. CBOR specifically wants UTF-8 for major type 3, and that fact is at the heart of some of our recent dramas around our sloppy Go dag-cbor implementations that don't care what the conversion from a @vmx any input on this? |
I'd be happy if strings in the IPLD Data Model would be described as a list of Unicode characters. The data model really is something abstract. Even if e.g. your codec is using UTF-8, your implementation might still represent internally an IPLD Data Model String as UTF-16 (I'm thinking of e.g. Java here). But that really is not the concern of the data model. I'd then probably add a sentence (somewhere, perhaps in a doc about codecs) that the exact encoding depends on the codec, and in case of DAG-CBOR, DAG-PB and DAG-JSON it means that a string is encoded as UTF-8 as those underlying formats demand it in their specifications. |
@rvagg this doesn't seem like a Overall, if we're going to restrict the scope of the data model rather than expand it I'd want to understand:
In particular, if we switch the data model to require String == Unicode (and can't be arbitrary bytes) then IIUC (the docs website isn't explicit about this in the data model section but is in the schema section when it describes schemas as a superset of the data model) this ends up insisting that map keys can only be Unicode which then a) starts to cause problems with code people have already written against the existing data model b) prevents us from representing non-Unicode map keys which people on occasion have use for and IIUC have already made use of |
Still serious about the Otherwise, this becomes a very low priority for us because we're already weary of this discussion and don't have the bandwidth to get a PR over the line at the moment anyway. It's less than a P3. |
The "String" kind should be defined as "text". Specifically, any text that can be mapped to unicode.
Strings are not for byte-like things. Strings are not (as much as go would like them to be) "immutable byte strings". Strings are text. They don't have to be unicode, but unicode is designed to be the superset of all text-things anyone might want to write, so it's a good starting point.
It's really important to distinguish between the concept of bytes and the concept of text: the string "foo" is composed of a sequence of symbols
f
,o
,o
. The underlying encoding is up to the implementation.The text was updated successfully, but these errors were encountered: