Editorial: revamp the way we deal with code points and bytes #247

annevk · 2020-11-02T17:56:16Z

This is WIP, mainly since I'm a little unsure we want to go this far, but I also kinda like it.

@andreubotella @ricea @domenic @aphillips thoughts?

Preview | Diff

This is WIP, mainly since I'm a little unsure we want to go this far, but I also kinda like it.

…is exercise)

domenic · 2020-11-02T18:07:29Z

Seems reasonable to me.

ricea · 2020-11-02T18:16:28Z

Am I understanding correctly that the purpose is disambiguate code-point and byte conversions?

If so, my concern is that the extra rigor creates extra opportunities for errors, and may not be pulling its weight.

However, if you prefer it, that's good enough for me.

aphillips

I like this generally as a direction, but made some "food for thought" comments below.

aphillips · 2020-11-02T19:11:01Z

encoding.bs

@@ -1915,7 +1916,7 @@ constructor steps are:
 <p class=note>{{DOMString}}, as well as an <a for=/>I/O queue</a> of code units rather than scalar
 values, are used here so that a surrogate pair that is split between chunks can be reassembled into
 the appropriate scalar value. The behavior is otherwise identical to {{USVString}}. In particular,
- lone surrogates will be replaced with U+FFFD.
+ lone surrogates will be replaced with U+FFFD (�).


In Charmod we often followed the convention:

� [U+FFFD REPLACEMENT CHARACTER]

(with the [U+xxxx character name] part styled distinctly). I say "often" because I willfully ignored the convention whenever it reduced clarity, particularly with long sequences used in this or that example. For examples this like, you might consider something similar, since it makes the text unambiguous?

OTOH, I find this pretty clear and am not sure that the charmod style adds that much. I like quoting the character like this when it's printable.

We made up our own convention in https://infra.spec.whatwg.org/#code-points since we found the one in Charmod a bit too verbose, iirc.

aphillips · 2020-11-02T19:17:03Z

encoding.bs


- <li><p>If <var>byte</var> is an <a>ASCII byte</a>, return
- a code point whose value is <var>byte</var>.
+ <li><p>Let <var>byteValue</var> be <var>byte</var>'s <a for=byte>value</a>.


is byteValue really needed vs. just saying things like:

If byte is an ASCII byte, then return a code point whose value is byte's value.

I realize that "code point's value" is a different integer type than "byte's value", but we mean the number in any case.

aphillips · 2020-11-02T19:26:30Z

encoding.bs

+ <a for="code point">value</a> is <var>byteValue</var>.
+
+ <li><p>Return a <a>code point</a> whose <a for="code point">value</a> is
+ 0xF780 + <var>byteValue</var> &minus; 0x80.


I see the problem. You don't want prose here. But can't we just say 0xF780 + byte - 0x80?

Is there a reason I'm not seeing for why we don't just make the number 0xF700? Is the reason to emphasize that we're trying to get to/from bytes >= 0x80?

We've had some cases where we want to distinguish bytes from numbers. So the question is whether we want to do that here as well. And I guess in some sense we do since we want to return code points or bytes, but a lot of the calculations are on numbers.

I think we could use byte in the calculation directly (as we already did), but it wouldn't really be logically consistent with how we talk about bytes and numbers elsewhere in the web platform.

(I guess another way would be that we say that in equations they are casted to their value.)

We could define implicit conversions code point → number and byte → number (whatwg/infra#319) and perhaps the other way around too. But even if we don't, we could use short algorithmic phrases inside the formula: "0xF780 + (byte's value) − 0x80".

There are other formulas in the standard that use byte or code point values directly, though, and they should be changed accordingly. (Interestingly, there are formulas dealing with code units around TextEncoder and TextEncoderStream, which don't have this problem because code units seem to be defined directly as a number type.)

FWIW I intuitively like making the code point <—> byte/number conversions explicit, and don't see as much of a need for distinguishing bytes and numbers. (I'd be OK defining bytes as a subtype of numbers, if we ever make progress on defining numbers.)

aphillips · 2020-11-02T19:27:41Z

encoding.bs

- <li><p>If <var>code point</var> is in the range U+F780 to U+F7FF, inclusive, return
- a byte whose value is <var>code point</var> &minus; 0xF780 + 0x80.
+ <li><p>If <var>codePointValue</var> is in the range 0xF780 to 0xF7FF, inclusive, then return a
+ <a>byte</a> whose <a for=byte>value</a> is <var>codePointValue</var> &minus; 0xF780 + 0x80.


annevk added 2 commits November 2, 2020 18:55

Editorial: revamp the way we deal with code points and bytes

fe18a76

This is WIP, mainly since I'm a little unsure we want to go this far, but I also kinda like it.

sketch nit (getting this all correct might be the trickiest bit of th…

a92bda8

…is exercise)

aphillips reviewed Nov 2, 2020

View reviewed changes

Base automatically changed from master to main January 15, 2021 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: revamp the way we deal with code points and bytes #247

Editorial: revamp the way we deal with code points and bytes #247

annevk commented Nov 2, 2020 •

edited by pr-preview bot

domenic commented Nov 2, 2020

ricea commented Nov 2, 2020

aphillips left a comment

aphillips Nov 2, 2020

annevk Nov 3, 2020

aphillips Nov 2, 2020

aphillips Nov 2, 2020

annevk Nov 3, 2020 •

edited

andreubotella Nov 3, 2020

domenic Nov 3, 2020

aphillips Nov 2, 2020

Editorial: revamp the way we deal with code points and bytes #247

Are you sure you want to change the base?

Editorial: revamp the way we deal with code points and bytes #247

Conversation

annevk commented Nov 2, 2020 • edited by pr-preview bot

domenic commented Nov 2, 2020

ricea commented Nov 2, 2020

aphillips left a comment

Choose a reason for hiding this comment

aphillips Nov 2, 2020

Choose a reason for hiding this comment

annevk Nov 3, 2020

Choose a reason for hiding this comment

aphillips Nov 2, 2020

Choose a reason for hiding this comment

aphillips Nov 2, 2020

Choose a reason for hiding this comment

annevk Nov 3, 2020 • edited

Choose a reason for hiding this comment

andreubotella Nov 3, 2020

Choose a reason for hiding this comment

domenic Nov 3, 2020

Choose a reason for hiding this comment

aphillips Nov 2, 2020

Choose a reason for hiding this comment

annevk commented Nov 2, 2020 •

edited by pr-preview bot

annevk Nov 3, 2020 •

edited