Suggested Feature: Unicode Properties #648

lachrist · 2020-05-11T07:57:30Z

Issue type

Bug Report: no
Feature Request: yes
Question: no
Not an issue: no

Prerequisites

Can you reproduce the issue?: yes
Did you search the repository issues?: yes
Did you check the forums?: yes
Did you perform a web search (google, yahoo, etc)?: yes

Description

Hi, first of all, thanks a lot for the work here!
I'm wondering why there is no facilities to accept characters based on unicode properties. The examples I've seen are based on enumerations and character ranges. But I argue that using unicode properties is cleaner.

Currently this is possible but requires the use of JS RegExp wrapped around predicates which seems convoluted and slow:

(I Apologies if there is a cleaner way to use use predicate, I'm fairly new to this module)

Identifier =
  head:$(char:.&{ return /\p{ID_Start}/u.test(char) })
  tail:$((char:.&{ return /\p{ID_Continue}/u.test(char) })*)
  { return head + tail }

Slightly enriching the pegjs syntax would lead to a much nicer result:

Identifier =
  head:p{ID_Start}
  tail:$(p{ID_Continue}*)
  { return head + tail }`

Am I missing something?

Regards,
Laurent

Seb35 · 2020-06-01T06:56:26Z

I did not know Unicode properties, so if others are wondering about this, see this Unicode annex, and ECMAScript syntax.

This is quite new in ECMAScript (9th Edition – ECMAScript 2018) and I guess it would be very complicated to polyfill it, at least inside PEG.js; so perhaps the user should be warned its parser is not compatible with older environments? (Perhaps Babel has or will have a polyfill, but it is outside of PEG.js.)

About the syntax I would favour [\p{ID_Start}] to use an (almost-)existing syntax aligned with ECMAScript syntax. Trying it with current version of PEG.js, it does not work because the \ is removed in the RegExp, so [\d] does not work in the same way (see next snippet). We could profit of this feature request to add these other classical classes.

console.log( pegjs.generate( "Number = [\d]", { output: "source" } ) );
// See beginning of function peg$parse around line 140:
// variable peg$c0 or peg$r0 depending on PEG.js version

brettz9 · 2020-06-28T15:17:53Z

Great idea. I think there are at least a few options re: polyfilling:

Add regexpu as a runtime dependency, e.g., with https://github.com/mathiasbynens/regexpu#regexputranspilecodecode-options
Require Node 10 as a minimum (Node 8 is end-of-lifed anyways). Node 10 should allow its use directly: https://node.green/#ES2018-features--RegExp-Unicode-Property-Escapes . Browser support is pretty good per https://caniuse.com/#feat=mdn-javascript_builtins_regexp_property_escapes but for older browsers one may wish to refer others to the limitations/need for a polyfill.
If there were a desire to to allow source, and progressively distribution files, to increasingly make use of ES6+ features, the project could go the Babel route, and add this plugin (by the same Unicode expert author as regexpu).

Seb35 mentioned this issue Jun 4, 2020

Full Unicode support, namely for codepoints outside the BMP #586

Open

brettz9 mentioned this issue Jun 29, 2020

Role with catharsis, jsdoc's type parser, and TypeScript's parser jsdoctypeparser/jsdoctypeparser#109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested Feature: Unicode Properties #648

Suggested Feature: Unicode Properties #648

lachrist commented May 11, 2020 •

edited

Seb35 commented Jun 1, 2020 •

edited

brettz9 commented Jun 28, 2020 •

edited

Suggested Feature: Unicode Properties #648

Suggested Feature: Unicode Properties #648

Comments

lachrist commented May 11, 2020 • edited

Issue type

Prerequisites

Description

Seb35 commented Jun 1, 2020 • edited

brettz9 commented Jun 28, 2020 • edited

lachrist commented May 11, 2020 •

edited

Seb35 commented Jun 1, 2020 •

edited

brettz9 commented Jun 28, 2020 •

edited