-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Draft) Better Unicode Support #477
base: main
Are you sure you want to change the base?
Conversation
@levbishop , I see above "replaces #422". Shouldn't this instead be "replaces #428"? If so, I'll close #428. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I haven't been in meetings any time recently, so I might not have all the context, but since I was asked by Lev, here's a couple of comments on the text.
In terms of code of the ANTLR-based parser, if I were reviewing something for a product that I owned, I'd complain about a lack of testing here, and that the normalisation hasn't actually been implemented. I don't necessarily think that that should actually stand in the way of the spec change, though, given how complex it is.
| :math:`\pi` | pi | - µ U+00B5 MICRO SIGN | 3.1415926535... | | ||
| | | - μ U+03BC GREEK SMALL LETTER MU | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are some pretty unusual representations of pi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops i'll fix
: '"' ~["\r\t\n]+? '"' | ||
| '\'' ~['\r\t\n]+? '\'' | ||
: '"' NotUAX31NewlineDoubleQuote+? '"' | ||
| '\'' NotUAX31NewlineSingleQuote+? '\'' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, \r
and \t
were also forbidden in string literals, which this PR now allows. I think that's totally fine, especially since the spec at no point actually even defines a string literal (as opposed to a "bitstring literal", except for a somewhat implicit definition in the "included files" bit), I just wanted to mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String literals supporting a string data type will be essentially impossible to add later if desired (I think likely) anyway. Because of bitstring literals. EDIT: Well, maybe some kind of context or constructor would do.
A precise statement of the Unicode compatibility is: | ||
|
||
- `UAX31-C1 <https://www.unicode.org/reports/tr31/tr31-37.html#C1>`_: The OpenQASM language conforms to version 37 of the Unicode® Standard Annex #31 | ||
- `UAX31-C2 <https://www.unicode.org/reports/tr31/tr31-37.html#C2>`_: It observes the following requirements: | ||
- `UAX31-R1-2 <https://www.unicode.org/reports/tr31/tr31-37.html#R1-2>`_: Default Identifiers: To determine whether a string is an identifier it uses `UAX31-D1 <https://www.unicode.org/reports/tr31/tr31-37.html#D1>`_ with the following profile: | ||
- ``Start := [[:XID_Start:]_]`` | ||
- ``Continue := [:XID_Continue:]`` | ||
- ``Medial := []`` | ||
- `UAX31-R1b <https://www.unicode.org/reports/tr31/tr31-37.html#R1b>`_ Stable Identifiers: Once a string qualifies as an identifier, it does so in all future versions. | ||
- `UAX31-R4 <https://www.unicode.org/reports/tr31/tr31-37.html#R4>`_. Equivalent Normalized Identifiers using normalization form C (NFC). | ||
|
||
Additionally, to avoid line-break spoofing, we comply with the proposed | ||
|
||
- `UAX31-R3a-1`. Use ``Pattern_White_Space`` characters as all and only those the set of characters interpreted as whitespace in parsing., as follows: | ||
- A sequence of one or more of any of the following characters shall be interpreted as a sequence of one or more end of line: | ||
- ``U+000A`` (line feed) | ||
- ``U+000B`` (vertical tabulation) | ||
- ``U+000C`` (form feed) | ||
- ``U+000D`` (carriage return) | ||
- ``U+0085`` (next line) | ||
- ``U+2028`` LINE SEPARATOR | ||
- ``U+2029`` PARAGRAPH SEPARATOR | ||
- The ``Pattern_White_Space`` characters with the property ``Default_Ignorable_Code_Point`` shall be treated as ignorable format controls | ||
- All other characters in ``Pattern_White_Space`` shall be interpreted as horizontal space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally seems sensible from an abstract perspective. Is it worth mentioning / calling out that we don't actually expect any near-term implementers to have fully implemented all the Unicode normalisation?
Also, this text technically permits the sequence U+OOOA U+OOOA
to represented 1, 2 or really any natural number of "ends of line". I don't know how much you care about trying to define something unambiguous that's sensible here, though, since even the old \n
/ \r\n
/ \r
split isn't the easiest thing in the world to define.
Summary
Fixes #422
Fixes #440
Replaces #428
This tries to bring the spec and grammar more into compliance with the latest proposed UAX31 and draft UTS55
Details and comments
ANTLR doesn't have great support for set intersection/difference operations on unicode character sets so I had to duplicate some fragments pulling the definitions from PropList.txt but these shouldn't ever be changing.
I'm still thinking about how best to handle the rules for ignorable white space.