Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested Feature: Unicode Properties #648

Open
lachrist opened this issue May 11, 2020 · 2 comments
Open

Suggested Feature: Unicode Properties #648

lachrist opened this issue May 11, 2020 · 2 comments

Comments

@lachrist
Copy link

lachrist commented May 11, 2020

Issue type

  • Bug Report: no
  • Feature Request: yes
  • Question: no
  • Not an issue: no

Prerequisites

  • Can you reproduce the issue?: yes
  • Did you search the repository issues?: yes
  • Did you check the forums?: yes
  • Did you perform a web search (google, yahoo, etc)?: yes

Description

Hi, first of all, thanks a lot for the work here!
I'm wondering why there is no facilities to accept characters based on unicode properties. The examples I've seen are based on enumerations and character ranges. But I argue that using unicode properties is cleaner.

Currently this is possible but requires the use of JS RegExp wrapped around predicates which seems convoluted and slow:

(I Apologies if there is a cleaner way to use use predicate, I'm fairly new to this module)

Identifier =
  head:$(char:.&{ return /\p{ID_Start}/u.test(char) })
  tail:$((char:.&{ return /\p{ID_Continue}/u.test(char) })*)
  { return head + tail }

Slightly enriching the pegjs syntax would lead to a much nicer result:

Identifier =
  head:p{ID_Start}
  tail:$(p{ID_Continue}*)
  { return head + tail }`

Am I missing something?

Regards,
Laurent

@Seb35
Copy link

Seb35 commented Jun 1, 2020

I did not know Unicode properties, so if others are wondering about this, see this Unicode annex, and ECMAScript syntax.

This is quite new in ECMAScript (9th Edition – ECMAScript 2018) and I guess it would be very complicated to polyfill it, at least inside PEG.js; so perhaps the user should be warned its parser is not compatible with older environments? (Perhaps Babel has or will have a polyfill, but it is outside of PEG.js.)

About the syntax I would favour [\p{ID_Start}] to use an (almost-)existing syntax aligned with ECMAScript syntax. Trying it with current version of PEG.js, it does not work because the \ is removed in the RegExp, so [\d] does not work in the same way (see next snippet). We could profit of this feature request to add these other classical classes.

console.log( pegjs.generate( "Number = [\d]", { output: "source" } ) );
// See beginning of function peg$parse around line 140:
// variable peg$c0 or peg$r0 depending on PEG.js version

@brettz9
Copy link

brettz9 commented Jun 28, 2020

Great idea. I think there are at least a few options re: polyfilling:

  1. Add regexpu as a runtime dependency, e.g., with https://github.com/mathiasbynens/regexpu#regexputranspilecodecode-options
  2. Require Node 10 as a minimum (Node 8 is end-of-lifed anyways). Node 10 should allow its use directly: https://node.green/#ES2018-features--RegExp-Unicode-Property-Escapes . Browser support is pretty good per https://caniuse.com/#feat=mdn-javascript_builtins_regexp_property_escapes but for older browsers one may wish to refer others to the limitations/need for a polyfill.
  3. If there were a desire to to allow source, and progressively distribution files, to increasingly make use of ES6+ features, the project could go the Babel route, and add this plugin (by the same Unicode expert author as regexpu).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants