Editorial: refer to code points directly by name/number instead of using aliases #3310

michaelficarra · 2024-04-04T19:52:12Z

spec.html

gibson042 · 2024-04-10T17:30:15Z

spec.html

@@ -588,7 +588,7 @@ <h1>Terminal Symbols</h1>
        <p>In contrast, in the syntactic grammar, a contiguous run of fixed-width code points is a single terminal symbol.</p>
        <p>Terminal symbols come in two other forms:</p>
        <ul>
-          <li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "&lt;ABBREV>" where "ABBREV" is a mnemonic for the code point or set of code points. These forms are defined in <emu-xref href="#sec-unicode-format-control-characters" title></emu-xref>, <emu-xref href="#sec-white-space" title></emu-xref>, and <emu-xref href="#sec-line-terminators" title></emu-xref>.</li>
+          <li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "&lt;U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>


This seems gratuitously divergent from Unicode conventions. Should we instead try to align?

Are you saying we should use small caps? As for the name, I chose to use one of the official aliases when I felt it was more appropriate/descriptive. I can explicitly state that it is the code point name or an alias if you prefer.

Yes, I think we should use small caps and avoid brackets except for sequences, e.g.

Suggested change

<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>

<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "U+0000 <small class="code-point-name">NULL</small>" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li>

or maybe ecmarkup support

Suggested change

<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>

<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<code data-char-name="NULL">U+0000</code>" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li>

or even ecmarkdown

Suggested change

<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>

<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "U+0000 ^^NULL^^" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li>

I think it's fine to defer the small-caps names (with possible tooling support) to a follow-up.

Yeah, I think so. 👍

spec.html

gibson042 · 2024-04-22T18:37:36Z

spec.html

    </emu-grammar>
+    <emu-note>
+      <p>Other than for some of the code points listed as explicit alternatives in |WhiteSpace|, |WhiteSpace| intentionally excludes <a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BWhite_Space%7D%26%5Cp%7BGeneral_Category%21%3DSpace_Separator%7D%5D">all code points that have the Unicode “White_Space” property but which are not classified in general category “Space_Separator” (“Zs”)</a>.</p>


#3303 (comment)

The link is good, but I still think an explicit mention of U+0085 (NEXT LINE) and probably also U+FEFF (ZERO WIDTH NO-BREAK SPACE) would be better. As observed in tc39/proposal-regexp-v-flag#37, the classification of these two code points is easy to overlook, and IMO it behooves the spec to highlight that.

Note also that a) https://util.unicode.org/UnicodeJsps is frequently unavailable, and in the recent past was offline for months, and b) even when it is available, there is no obvious indication that 7 of the 8 code points are included in ECMA-262 |LineTerminator| (and thus regular expression pattern \s, which exactly covers the union of |WhiteSpace| and |LineTerminator|) and 1 in the middle is not.

sample output

Basic Latin — C0 controls
items: 5

U+0009 CHARACTER TABULATION; HORIZONTAL TABULATION; HT; TAB

U+000A END OF LINE; EOL; LF; LINE FEED; NEW LINE; NL

� U+000B LINE TABULATION; VERTICAL TABULATION; VT

U+000C FF; FORM FEED

U+000D CARRIAGE RETURN; CR

Latin 1 Supplement — C1 controls
items: 1

� U+0085 NEL; NEXT LINE

General Punctuation — Separators
items: 2

  U+2028 LINE SEPARATOR

  U+2029 PARAGRAPH SEPARATOR

spec.html

addressed

jmdyck

Everything else LGTM.

spec.html

jmdyck · 2024-05-23T16:07:05Z

spec.html

-              </th>
-              <th>
-                Code Unit Value
+                |SingleEscapeCharacter|


Note that a |SingleEscapeCharacter| is a single character and does not include the preceding backslash. So it's a mismatch to have |SingleEscapeCharacter| as the column head and then (e.g.) \b below it. (The status quo uses "Escape Sequence" as the column head, which is not a defined term. You'd have to go up to |DoubleStringCharacter| and |SingleStringCharacter| to get a nonterminal that actually includes the backslash.)

The simplest fix would be to delete the backslashes from the data cells (as in Table 61: ControlEscape Code Point Values), although that loses the visual cue that they're 'escape sequences'.

Alternatively, you could insert a backslash into the header cell, but that's a bit dodgy, since:

`\` |SingleEscapeCharacter| doesn't occur in the grammar, and

the prose associated with |SingleEscapeCharacter| wouldn't be quite right.

I'll just remove the backslash.

Actually, I think I'll replace it with a code point descriptor.

Well, that's valid, but I'm not sure it's an improvement (over just removing the backslashes). The definition of SingleEscapeCharacter is one of ' " \ b f n r t v, so it seems like the natural approach would be to use those characters rather than code point descriptors.

I'm fine with changing to just the single character if the other editors prefer.

addressed

michaelficarra force-pushed the GH-2930 branch 4 times, most recently from 54acd44 to ce3e176 Compare April 4, 2024 19:59

michaelficarra marked this pull request as ready for review April 6, 2024 02:44

jmdyck previously requested changes Apr 6, 2024

View reviewed changes

spec.html Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

spec.html Outdated Show resolved Hide resolved

gibson042 reviewed Apr 10, 2024

View reviewed changes

spec.html Outdated Show resolved Hide resolved

michaelficarra added editor call to be discussed in the next editor call and removed editor call to be discussed in the next editor call labels Apr 17, 2024

gibson042 reviewed Apr 22, 2024

View reviewed changes

michaelficarra force-pushed the GH-2930 branch from 89a747a to e1c3634 Compare May 16, 2024 22:54

michaelficarra added editorial change editor call to be discussed in the next editor call labels May 16, 2024

jmdyck reviewed May 16, 2024

View reviewed changes

spec.html Outdated Show resolved Hide resolved

jmdyck reviewed May 17, 2024

View reviewed changes

spec.html Show resolved Hide resolved

michaelficarra mentioned this pull request May 17, 2024

new notation for Unicode code points in ES grammar es-meta/esmeta#220

Closed

michaelficarra added 13 commits May 22, 2024 15:05

WIP

9137b6f

add a note about notation back

0660251

more consistent notation

85c7abf

hexits

80989f2

revert note change

b9365c3

feedback

5d70c9e

un-revert the note

e15a4de

more <(LF|CR|LS|PS)>

4104527

revert ASCIISign change

cb8bd8c

fix formatting

0a64ef7

everyone always forgets about Annex A

a5c63f0

typo

eb07d91

still typo

93c25fc

michaelficarra force-pushed the GH-2930 branch from 30dd718 to 93c25fc Compare May 22, 2024 22:05

michaelficarra removed the editor call to be discussed in the next editor call label May 22, 2024

michaelficarra requested a review from jmdyck May 23, 2024 02:13

jmdyck previously requested changes May 23, 2024

View reviewed changes

fix Single Character Escape Sequences table

525c127

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: refer to code points directly by name/number instead of using aliases #3310

Editorial: refer to code points directly by name/number instead of using aliases #3310

michaelficarra commented Apr 4, 2024

gibson042 Apr 10, 2024

michaelficarra Apr 10, 2024

gibson042 Apr 11, 2024 •

edited

michaelficarra May 16, 2024

gibson042 May 17, 2024

gibson042 Apr 22, 2024

Basic Latin — C0 controls
items: 5

Latin 1 Supplement — C1 controls
items: 1

General Punctuation — Separators
items: 2

jmdyck left a comment

jmdyck May 23, 2024

michaelficarra May 23, 2024

michaelficarra May 23, 2024

jmdyck May 23, 2024

michaelficarra May 23, 2024

	<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>
	<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "U+0000 <small class="code-point-name">NULL</small>" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li>

Basic Latin — C0 controls items: 5
	`U+0009`	CHARACTER TABULATION; HORIZONTAL TABULATION; HT; TAB
	`U+000A`	END OF LINE; EOL; LF; LINE FEED; NEW LINE; NL
�	`U+000B`	LINE TABULATION; VERTICAL TABULATION; VT
	`U+000C`	FF; FORM FEED
	`U+000D`	CARRIAGE RETURN; CR
Latin 1 Supplement — C1 controls items: 1
�	`U+0085`	NEL; NEXT LINE
General Punctuation — Separators items: 2
	`U+2028`	LINE SEPARATOR
	`U+2029`	PARAGRAPH SEPARATOR

Editorial: refer to code points directly by name/number instead of using aliases #3310

Are you sure you want to change the base?

Editorial: refer to code points directly by name/number instead of using aliases #3310

Conversation

michaelficarra commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibson042 Apr 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Basic Latin — C0 controls items: 5

Latin 1 Supplement — C1 controls items: 1

General Punctuation — Separators items: 2

jmdyck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibson042 Apr 11, 2024 •

edited

Basic Latin — C0 controls
items: 5

Latin 1 Supplement — C1 controls
items: 1

General Punctuation — Separators
items: 2