(Draft) Better Unicode Support #477

levbishop · 2023-06-21T15:27:26Z

Summary

Fixes #422
Fixes #440
Replaces #428

This tries to bring the spec and grammar more into compliance with the latest proposed UAX31 and draft UTS55

Details and comments

ANTLR doesn't have great support for set intersection/difference operations on unicode character sets so I had to duplicate some fragments pulling the definitions from PropList.txt but these shouldn't ever be changing.

I'm still thinking about how best to handle the rules for ignorable white space.

jlapeyre · 2023-07-21T15:08:24Z

@levbishop , I see above "replaces #422". Shouldn't this instead be "replaces #428"? If so, I'll close #428.

jakelishman

I know I haven't been in meetings any time recently, so I might not have all the context, but since I was asked by Lev, here's a couple of comments on the text.

In terms of code of the ANTLR-based parser, if I were reviewing something for a product that I owned, I'd complain about a lack of testing here, and that the normalisation hasn't actually been implemented. I don't necessarily think that that should actually stand in the way of the spec change, though, given how complex it is.

jakelishman · 2023-07-21T15:13:27Z

source/language/types.rst

+      | :math:`\pi`                   | pi     | - µ U+00B5 MICRO SIGN            | 3.1415926535...     |
+      |                               |        | - μ U+03BC GREEK SMALL LETTER MU |                     |


Those are some pretty unusual representations of pi.

oops i'll fix

jakelishman · 2023-07-21T15:22:17Z

source/grammar/qasm3Lexer.g4

-    : '"' ~["\r\t\n]+? '"'
-    | '\'' ~['\r\t\n]+? '\''
+    : '"' NotUAX31NewlineDoubleQuote+? '"'
+    | '\'' NotUAX31NewlineSingleQuote+? '\''


Previously, \r and \t were also forbidden in string literals, which this PR now allows. I think that's totally fine, especially since the spec at no point actually even defines a string literal (as opposed to a "bitstring literal", except for a somewhat implicit definition in the "included files" bit), I just wanted to mention it.

String literals supporting a string data type will be essentially impossible to add later if desired (I think likely) anyway. Because of bitstring literals. EDIT: Well, maybe some kind of context or constructor would do.

jakelishman · 2023-07-21T15:27:09Z

source/language/types.rst

+A precise statement of the Unicode compatibility is:
+
+- `UAX31-C1 <https://www.unicode.org/reports/tr31/tr31-37.html#C1>`_: The OpenQASM language conforms to version 37 of the Unicode® Standard Annex #⁠31
+- `UAX31-C2 <https://www.unicode.org/reports/tr31/tr31-37.html#C2>`_: It observes the following requirements:
+   - `UAX31-R1-2 <https://www.unicode.org/reports/tr31/tr31-37.html#R1-2>`_: Default Identifiers: To determine whether a string is an identifier it uses `UAX31-D1 <https://www.unicode.org/reports/tr31/tr31-37.html#D1>`_ with the following profile:
+      - ``Start := [[:XID_Start:]_]``
+      - ``Continue := [:XID_Continue:]``
+      - ``Medial := []``
+   - `UAX31-R1b <https://www.unicode.org/reports/tr31/tr31-37.html#R1b>`_ Stable Identifiers: Once a string qualifies as an identifier, it does so in all future versions.
+   - `UAX31-R4 <https://www.unicode.org/reports/tr31/tr31-37.html#R4>`_. Equivalent Normalized Identifiers using normalization form C (NFC).
+
+Additionally, to avoid line-break spoofing, we comply with the proposed
+
+- `UAX31-R3a-1`. Use ``Pattern_White_Space`` characters as all and only those the set of characters interpreted as whitespace in parsing., as follows:
+   - A sequence of one or more of any of the following characters shall be interpreted as a sequence of one or more end of line:
+      -  ``U+000A`` (line feed)
+      -  ``U+000B`` (vertical tabulation)
+      -  ``U+000C`` (form feed)
+      -  ``U+000D`` (carriage return)
+      -  ``U+0085`` (next line)
+      -  ``U+2028`` LINE SEPARATOR
+      -  ``U+2029`` PARAGRAPH SEPARATOR
+   - The ``Pattern_White_Space`` characters with the property ``Default_Ignorable_Code_Point`` shall be treated as ignorable format controls
+   - All other characters in ``Pattern_White_Space`` shall be interpreted as horizontal space.


This generally seems sensible from an abstract perspective. Is it worth mentioning / calling out that we don't actually expect any near-term implementers to have fully implemented all the Unicode normalisation?

Also, this text technically permits the sequence U+OOOA U+OOOA to represented 1, 2 or really any natural number of "ends of line". I don't know how much you care about trying to define something unambiguous that's sensible here, though, since even the old \n / \r\n / \r split isn't the easiest thing in the world to define.

levbishop · 2023-07-27T11:29:09Z

@levbishop , I see above "replaces #422". Shouldn't this instead be "replaces #428"? If so, I'll close #428.

Yes. I meant "closes #422 and replaces #428"

levbishop added 2 commits June 7, 2023 11:44

Unicode update

2965107

Fix grammar and test

1082e3d

blakejohnson mentioned this pull request Jul 19, 2023

In scan for TimeUnit fragment allow two unicode representations of mu #428

Closed

jakelishman reviewed Jul 21, 2023

View reviewed changes

jlapeyre added Reference Parser ANLTR-based Python reference parser clarify specification Change text of spec, not (mainly) semantics of spec labels Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Draft) Better Unicode Support #477

(Draft) Better Unicode Support #477

levbishop commented Jun 21, 2023 •

edited

jlapeyre commented Jul 21, 2023 •

edited

jakelishman left a comment

jakelishman Jul 21, 2023

levbishop Jul 27, 2023

jakelishman Jul 21, 2023

jlapeyre Jul 21, 2023 •

edited

jakelishman Jul 21, 2023

levbishop commented Jul 27, 2023

		\| :math:`\pi` \| pi \| - µ U+00B5 MICRO SIGN \| 3.1415926535... \|
		\| \| \| - μ U+03BC GREEK SMALL LETTER MU \| \|

(Draft) Better Unicode Support #477

Are you sure you want to change the base?

(Draft) Better Unicode Support #477

Conversation

levbishop commented Jun 21, 2023 • edited

Summary

Details and comments

jlapeyre commented Jul 21, 2023 • edited

jakelishman left a comment

Choose a reason for hiding this comment

jakelishman Jul 21, 2023

Choose a reason for hiding this comment

levbishop Jul 27, 2023

Choose a reason for hiding this comment

jakelishman Jul 21, 2023

Choose a reason for hiding this comment

jlapeyre Jul 21, 2023 • edited

Choose a reason for hiding this comment

jakelishman Jul 21, 2023

Choose a reason for hiding this comment

levbishop commented Jul 27, 2023

levbishop commented Jun 21, 2023 •

edited

jlapeyre commented Jul 21, 2023 •

edited

jlapeyre Jul 21, 2023 •

edited