Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Draft) Better Unicode Support #477

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

levbishop
Copy link
Contributor

@levbishop levbishop commented Jun 21, 2023

Summary

Fixes #422
Fixes #440
Replaces #428

This tries to bring the spec and grammar more into compliance with the latest proposed UAX31 and draft UTS55

Details and comments

ANTLR doesn't have great support for set intersection/difference operations on unicode character sets so I had to duplicate some fragments pulling the definitions from PropList.txt but these shouldn't ever be changing.

I'm still thinking about how best to handle the rules for ignorable white space.

@jlapeyre
Copy link
Contributor

jlapeyre commented Jul 21, 2023

@levbishop , I see above "replaces #422". Shouldn't this instead be "replaces #428"? If so, I'll close #428.

Copy link
Contributor

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I haven't been in meetings any time recently, so I might not have all the context, but since I was asked by Lev, here's a couple of comments on the text.

In terms of code of the ANTLR-based parser, if I were reviewing something for a product that I owned, I'd complain about a lack of testing here, and that the normalisation hasn't actually been implemented. I don't necessarily think that that should actually stand in the way of the spec change, though, given how complex it is.

Comment on lines +516 to +517
| :math:`\pi` | pi | - µ U+00B5 MICRO SIGN | 3.1415926535... |
| | | - μ U+03BC GREEK SMALL LETTER MU | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are some pretty unusual representations of pi.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops i'll fix

: '"' ~["\r\t\n]+? '"'
| '\'' ~['\r\t\n]+? '\''
: '"' NotUAX31NewlineDoubleQuote+? '"'
| '\'' NotUAX31NewlineSingleQuote+? '\''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, \r and \t were also forbidden in string literals, which this PR now allows. I think that's totally fine, especially since the spec at no point actually even defines a string literal (as opposed to a "bitstring literal", except for a somewhat implicit definition in the "included files" bit), I just wanted to mention it.

Copy link
Contributor

@jlapeyre jlapeyre Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String literals supporting a string data type will be essentially impossible to add later if desired (I think likely) anyway. Because of bitstring literals. EDIT: Well, maybe some kind of context or constructor would do.

Comment on lines +14 to +37
A precise statement of the Unicode compatibility is:

- `UAX31-C1 <https://www.unicode.org/reports/tr31/tr31-37.html#C1>`_: The OpenQASM language conforms to version 37 of the Unicode® Standard Annex #⁠31
- `UAX31-C2 <https://www.unicode.org/reports/tr31/tr31-37.html#C2>`_: It observes the following requirements:
- `UAX31-R1-2 <https://www.unicode.org/reports/tr31/tr31-37.html#R1-2>`_: Default Identifiers: To determine whether a string is an identifier it uses `UAX31-D1 <https://www.unicode.org/reports/tr31/tr31-37.html#D1>`_ with the following profile:
- ``Start := [[:XID_Start:]_]``
- ``Continue := [:XID_Continue:]``
- ``Medial := []``
- `UAX31-R1b <https://www.unicode.org/reports/tr31/tr31-37.html#R1b>`_ Stable Identifiers: Once a string qualifies as an identifier, it does so in all future versions.
- `UAX31-R4 <https://www.unicode.org/reports/tr31/tr31-37.html#R4>`_. Equivalent Normalized Identifiers using normalization form C (NFC).

Additionally, to avoid line-break spoofing, we comply with the proposed

- `UAX31-R3a-1`. Use ``Pattern_White_Space`` characters as all and only those the set of characters interpreted as whitespace in parsing., as follows:
- A sequence of one or more of any of the following characters shall be interpreted as a sequence of one or more end of line:
- ``U+000A`` (line feed)
- ``U+000B`` (vertical tabulation)
- ``U+000C`` (form feed)
- ``U+000D`` (carriage return)
- ``U+0085`` (next line)
- ``U+2028`` LINE SEPARATOR
- ``U+2029`` PARAGRAPH SEPARATOR
- The ``Pattern_White_Space`` characters with the property ``Default_Ignorable_Code_Point`` shall be treated as ignorable format controls
- All other characters in ``Pattern_White_Space`` shall be interpreted as horizontal space.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally seems sensible from an abstract perspective. Is it worth mentioning / calling out that we don't actually expect any near-term implementers to have fully implemented all the Unicode normalisation?

Also, this text technically permits the sequence U+OOOA U+OOOA to represented 1, 2 or really any natural number of "ends of line". I don't know how much you care about trying to define something unambiguous that's sensible here, though, since even the old \n / \r\n / \r split isn't the easiest thing in the world to define.

@levbishop
Copy link
Contributor Author

@levbishop , I see above "replaces #422". Shouldn't this instead be "replaces #428"? If so, I'll close #428.

Yes. I meant "closes #422 and replaces #428"

@jlapeyre jlapeyre added Reference Parser ANLTR-based Python reference parser clarify specification Change text of spec, not (mainly) semantics of spec labels Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarify specification Change text of spec, not (mainly) semantics of spec Reference Parser ANLTR-based Python reference parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify unicode support Ambiguity with character code for mu in microsecond
3 participants