Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: refer to code points directly by name/number instead of using aliases #3310

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

michaelficarra
Copy link
Member

Fixes #2930.

@michaelficarra michaelficarra marked this pull request as ready for review April 6, 2024 02:44
jmdyck
jmdyck previously requested changes Apr 6, 2024
spec.html Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
@@ -588,7 +588,7 @@ <h1>Terminal Symbols</h1>
<p>In contrast, in the syntactic grammar, a contiguous run of fixed-width code points is a single terminal symbol.</p>
<p>Terminal symbols come in two other forms:</p>
<ul>
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "&lt;ABBREV>" where "ABBREV" is a mnemonic for the code point or set of code points. These forms are defined in <emu-xref href="#sec-unicode-format-control-characters" title></emu-xref>, <emu-xref href="#sec-white-space" title></emu-xref>, and <emu-xref href="#sec-line-terminators" title></emu-xref>.</li>
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "&lt;U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems gratuitously divergent from Unicode conventions. Should we instead try to align?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying we should use small caps? As for the name, I chose to use one of the official aliases when I felt it was more appropriate/descriptive. I can explicitly state that it is the code point name or an alias if you prefer.

Copy link
Contributor

@gibson042 gibson042 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should use small caps and avoid brackets except for sequences, e.g.

Suggested change
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "&lt;U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "U+0000 <small class="code-point-name">NULL</small>" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li>

or maybe ecmarkup support

Suggested change
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "&lt;U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "<code data-char-name="NULL">U+0000</code>" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li>

or even ecmarkdown

Suggested change
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "&lt;U+0000 (NULL)>" where `0000` is 4 to 6 hexits representing the code point in hexadecimal notation and `NULL` is the code point name.</li>
<li>In the lexical and RegExp grammars, Unicode code points without a conventional printed representation are instead shown in the form "U+0000 ^^NULL^^" where `0000` is 4 to 6 hexadecimal digits representing the code point and `NULL` is the code point name.</li>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to defer the small-caps names (with possible tooling support) to a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so. 👍

spec.html Outdated Show resolved Hide resolved
@michaelficarra michaelficarra added editor call to be discussed in the next editor call and removed editor call to be discussed in the next editor call labels Apr 17, 2024
</emu-grammar>
<emu-note>
<p>Other than for some of the code points listed as explicit alternatives in |WhiteSpace|, |WhiteSpace| intentionally excludes <a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7BWhite_Space%7D%26%5Cp%7BGeneral_Category%21%3DSpace_Separator%7D%5D">all code points that have the Unicode “White_Space” property but which are not classified in general category “Space_Separator” (“Zs”)</a>.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3303 (comment)

The link is good, but I still think an explicit mention of U+0085 (NEXT LINE) and probably also U+FEFF (ZERO WIDTH NO-BREAK SPACE) would be better. As observed in tc39/proposal-regexp-v-flag#37, the classification of these two code points is easy to overlook, and IMO it behooves the spec to highlight that.

Note also that a) https://util.unicode.org/UnicodeJsps is frequently unavailable, and in the recent past was offline for months, and b) even when it is available, there is no obvious indication that 7 of the 8 code points are included in ECMA-262 |LineTerminator| (and thus regular expression pattern \s, which exactly covers the union of |WhiteSpace| and |LineTerminator|) and 1 in the middle is not.

sample output

Basic LatinC0 controls
items: 5

   U+0009CHARACTER TABULATION; HORIZONTAL TABULATION; HT; TAB
   U+000AEND OF LINE; EOL; LF; LINE FEED; NEW LINE; NL
 � U+000BLINE TABULATION; VERTICAL TABULATION; VT
   U+000CFF; FORM FEED
   U+000DCARRIAGE RETURN; CR

Latin 1 SupplementC1 controls
items: 1

 � U+0085NEL; NEXT LINE

General PunctuationSeparators
items: 2

 
 U+2028LINE SEPARATOR
 
 U+2029PARAGRAPH SEPARATOR

spec.html Outdated Show resolved Hide resolved
spec.html Show resolved Hide resolved
@michaelficarra michaelficarra removed the editor call to be discussed in the next editor call label May 22, 2024
jmdyck
jmdyck previously requested changes May 23, 2024
Copy link
Collaborator

@jmdyck jmdyck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else LGTM.

spec.html Show resolved Hide resolved
</th>
<th>
Code Unit Value
|SingleEscapeCharacter|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a |SingleEscapeCharacter| is a single character and does not include the preceding backslash. So it's a mismatch to have |SingleEscapeCharacter| as the column head and then (e.g.) \b below it. (The status quo uses "Escape Sequence" as the column head, which is not a defined term. You'd have to go up to |DoubleStringCharacter| and |SingleStringCharacter| to get a nonterminal that actually includes the backslash.)

The simplest fix would be to delete the backslashes from the data cells (as in Table 61: ControlEscape Code Point Values), although that loses the visual cue that they're 'escape sequences'.

Alternatively, you could insert a backslash into the header cell, but that's a bit dodgy, since:

  • `\` |SingleEscapeCharacter| doesn't occur in the grammar, and
  • the prose associated with |SingleEscapeCharacter| wouldn't be quite right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just remove the backslash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I'll replace it with a code point descriptor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's valid, but I'm not sure it's an improvement (over just removing the backslashes). The definition of SingleEscapeCharacter is one of ' " \ b f n r t v, so it seems like the natural approach would be to use those characters rather than code point descriptors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with changing to just the single character if the other editors prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow code points to be used directly in grammar without indirection
3 participants