String Kind as Unicode #203

Stebalien · 2022-04-05T21:04:56Z

The "String" kind should be defined as "text". Specifically, any text that can be mapped to unicode.

It should NOT even mention encoding. String encoding is defined by the codec, not the IPLD data model.
It should NOT allow arbitrary byte "strings". We have a "Bytes" kind for that.

Strings are not for byte-like things. Strings are not (as much as go would like them to be) "immutable byte strings". Strings are text. They don't have to be unicode, but unicode is designed to be the superset of all text-things anyone might want to write, so it's a good starting point.

It's really important to distinguish between the concept of bytes and the concept of text: the string "foo" is composed of a sequence of symbols f, o, o. The underlying encoding is up to the implementation.

The text was updated successfully, but these errors were encountered:

rvagg · 2022-04-20T07:10:11Z

Marking this as a good contribution for someone to attempt to make, although there's landmines all through this topic.

Doc here has the key content: https://github.com/ipld/ipld/blob/master/docs/data-model/kinds.md#string-kind

Some of the landmines are present in that doc, but I think the framing that @Stebalien is proposing here focusing on "text" and then "unicode". We focus on UTF-8 in the current doc, because that's really the dominant binary encoding form of unicode characters, but maybe, as @Stebalien is suggesting, that's getting far too into the codec-weeds when the topic is really about the data model, which isn't about encoding. I'm still fine mentioning UTF-8, simply because it's going to be the most common encoding format and is a useful reference point (e.g. CBOR specifically wants UTF-8 for major type 3, and that fact is at the heart of some of our recent dramas around our sloppy Go dag-cbor implementations that don't care what the conversion from a string to []byte does, so it does make it useful to talk about). But yeah, the focus of that doc should be what's going on in-memory and in the programmatic interfaces to the data model, not the encoded forms of the data.

@vmx any input on this?

vmx · 2022-04-20T12:43:38Z

I'd be happy if strings in the IPLD Data Model would be described as a list of Unicode characters. The data model really is something abstract. Even if e.g. your codec is using UTF-8, your implementation might still represent internally an IPLD Data Model String as UTF-16 (I'm thinking of e.g. Java here). But that really is not the concern of the data model.

I'd then probably add a sentence (somewhere, perhaps in a doc about codecs) that the exact encoding depends on the codec, and in case of DAG-CBOR, DAG-PB and DAG-JSON it means that a string is encoded as UTF-8 as those underlying formats demand it in their specifications.

aschmahmann · 2022-04-20T17:38:24Z

@rvagg this doesn't seem like a good-first-contribution kind of thing, but rather something that requires some figuring out from people who have more context (unless the idea is just to replace every instance of UTF-8 with Unicode, but leave the general text around allowing for bytes).

Overall, if we're going to restrict the scope of the data model rather than expand it I'd want to understand:

What current users are going to be hurt by the change?
What remedies will they have available?
How would those users have been able to make their system work in the new system if starting from scratch and would they be satisfied with the result?

In particular, if we switch the data model to require String == Unicode (and can't be arbitrary bytes) then IIUC (the docs website isn't explicit about this in the data model section but is in the schema section when it describes schemas as a superset of the data model) this ends up insisting that map keys can only be Unicode which then a) starts to cause problems with code people have already written against the existing data model b) prevents us from representing non-Unicode map keys which people on occasion have use for and IIUC have already made use of

rvagg · 2022-04-22T03:23:49Z

Still serious about the good-first-contribution label (I was talking to someone about getting involved as I was writing that originally) mainly because it's one of these topics that has very high educational value; working through a PR and pulling together stakeholders, even if a PR doesn't actually land (!) would be a valuable exercise for someone looking to specifically upskill in IPLD. Caveat emptor of course, but there are some folks in Launchpad that may want to go deeper in IPLD that could benefit from engaging on this topic because it touches many areas of concern at the data layer of our stack.

Otherwise, this becomes a very low priority for us because we're already weary of this discussion and don't have the bandwidth to get a PR over the line at the moment anyway. It's less than a P3.

rvagg added the good-first-contribution label Apr 20, 2022

rvagg added the help wanted label Apr 20, 2022

BigLep added the P3 label May 10, 2022

vmx mentioned this issue Jun 28, 2022

spec: initial WAC spec #226

Open

aschmahmann mentioned this issue Sep 13, 2022

IPLD WASM tracking issue #236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String Kind as Unicode #203

String Kind as Unicode #203

Stebalien commented Apr 5, 2022

rvagg commented Apr 20, 2022

vmx commented Apr 20, 2022

aschmahmann commented Apr 20, 2022 •

edited

rvagg commented Apr 22, 2022

String Kind as Unicode #203

String Kind as Unicode #203

Comments

Stebalien commented Apr 5, 2022

rvagg commented Apr 20, 2022

vmx commented Apr 20, 2022

aschmahmann commented Apr 20, 2022 • edited

rvagg commented Apr 22, 2022

aschmahmann commented Apr 20, 2022 •

edited