Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String Kind as Unicode #203

Open
Stebalien opened this issue Apr 5, 2022 · 4 comments
Open

String Kind as Unicode #203

Stebalien opened this issue Apr 5, 2022 · 4 comments

Comments

@Stebalien
Copy link
Contributor

The "String" kind should be defined as "text". Specifically, any text that can be mapped to unicode.

  • It should NOT even mention encoding. String encoding is defined by the codec, not the IPLD data model.
  • It should NOT allow arbitrary byte "strings". We have a "Bytes" kind for that.

Strings are not for byte-like things. Strings are not (as much as go would like them to be) "immutable byte strings". Strings are text. They don't have to be unicode, but unicode is designed to be the superset of all text-things anyone might want to write, so it's a good starting point.

It's really important to distinguish between the concept of bytes and the concept of text: the string "foo" is composed of a sequence of symbols f, o, o. The underlying encoding is up to the implementation.

@rvagg
Copy link
Member

rvagg commented Apr 20, 2022

Marking this as a good contribution for someone to attempt to make, although there's landmines all through this topic.

Doc here has the key content: https://github.com/ipld/ipld/blob/master/docs/data-model/kinds.md#string-kind

Some of the landmines are present in that doc, but I think the framing that @Stebalien is proposing here focusing on "text" and then "unicode". We focus on UTF-8 in the current doc, because that's really the dominant binary encoding form of unicode characters, but maybe, as @Stebalien is suggesting, that's getting far too into the codec-weeds when the topic is really about the data model, which isn't about encoding. I'm still fine mentioning UTF-8, simply because it's going to be the most common encoding format and is a useful reference point (e.g. CBOR specifically wants UTF-8 for major type 3, and that fact is at the heart of some of our recent dramas around our sloppy Go dag-cbor implementations that don't care what the conversion from a string to []byte does, so it does make it useful to talk about). But yeah, the focus of that doc should be what's going on in-memory and in the programmatic interfaces to the data model, not the encoded forms of the data.

@vmx any input on this?

@vmx
Copy link
Member

vmx commented Apr 20, 2022

I'd be happy if strings in the IPLD Data Model would be described as a list of Unicode characters. The data model really is something abstract. Even if e.g. your codec is using UTF-8, your implementation might still represent internally an IPLD Data Model String as UTF-16 (I'm thinking of e.g. Java here). But that really is not the concern of the data model.

I'd then probably add a sentence (somewhere, perhaps in a doc about codecs) that the exact encoding depends on the codec, and in case of DAG-CBOR, DAG-PB and DAG-JSON it means that a string is encoded as UTF-8 as those underlying formats demand it in their specifications.

@aschmahmann
Copy link

aschmahmann commented Apr 20, 2022

@rvagg this doesn't seem like a good-first-contribution kind of thing, but rather something that requires some figuring out from people who have more context (unless the idea is just to replace every instance of UTF-8 with Unicode, but leave the general text around allowing for bytes).

Overall, if we're going to restrict the scope of the data model rather than expand it I'd want to understand:

  1. What current users are going to be hurt by the change?
  2. What remedies will they have available?
  3. How would those users have been able to make their system work in the new system if starting from scratch and would they be satisfied with the result?

In particular, if we switch the data model to require String == Unicode (and can't be arbitrary bytes) then IIUC (the docs website isn't explicit about this in the data model section but is in the schema section when it describes schemas as a superset of the data model) this ends up insisting that map keys can only be Unicode which then a) starts to cause problems with code people have already written against the existing data model b) prevents us from representing non-Unicode map keys which people on occasion have use for and IIUC have already made use of

@rvagg
Copy link
Member

rvagg commented Apr 22, 2022

Still serious about the good-first-contribution label (I was talking to someone about getting involved as I was writing that originally) mainly because it's one of these topics that has very high educational value; working through a PR and pulling together stakeholders, even if a PR doesn't actually land (!) would be a valuable exercise for someone looking to specifically upskill in IPLD. Caveat emptor of course, but there are some folks in Launchpad that may want to go deeper in IPLD that could benefit from engaging on this topic because it touches many areas of concern at the data layer of our stack.

Otherwise, this becomes a very low priority for us because we're already weary of this discussion and don't have the bandwidth to get a PR over the line at the moment anyway. It's less than a P3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants