Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add built-in rules for Unicode ID_Start and ID_Continue #180

Open
pdubroy opened this issue Dec 13, 2016 · 10 comments
Open

Add built-in rules for Unicode ID_Start and ID_Continue #180

pdubroy opened this issue Dec 13, 2016 · 10 comments
Labels
ecmascript Issues relating to parsing ES2015+ in examples/ecmascript/ enhancement

Comments

@pdubroy
Copy link
Contributor

pdubroy commented Dec 13, 2016

It would be nice to have some built-in rules for identifiers, so that we have a sensible default for people building new languages in Ohm. For reference, I took a look at a few different languages to see if any of them use the Unicode ID_Start/ID_Continue. Here are the results:

  • ES6 (spec): Yes
  • Rust (spec): Yes
    • Uses XID_start and XID_continue
  • Python3 (spec): Yes
    • Uses XID_start and XID_continue
  • F# (spec): No
    • identStart: Ltmou | Nl | "_"
    • identPart: identStart | "0..9" | Pc | Mn | Mc | Cf
  • Swift (spec): No
    • Just specifies a bunch of Unicode ranges, not sure what they correspond to
  • Go (spec): No
    • identStart: Ltmou
    • identPart: identStart | Nd
  • Scala (spec): No
    • identStart: Ltmou | Nl | "$" | "_"
    • identPart: identStart | "0..9" | "_"

Not sure this makes things any clearer, but I thought it would be good to have a few data points.

@pdubroy
Copy link
Contributor Author

pdubroy commented Dec 13, 2016

One more note: it seems that ID_Start is a subset of (Ltmou | Nl), and doesn't include _ or $. So I'm a bit confused that the Rust and Python specs say that identifiers must begin with an XID_start character.

@mroeder
Copy link
Contributor

mroeder commented Dec 13, 2016

According to http://unicode.org/reports/tr31/#Specific_Character_Adjustments, _ and $ are optional characters for ID_Start. ES5/ES2015 adds those two characters to ID_Start but Python or Rust might not do so. Since Ohm does not allow to exclude characters when overriding, I would suggest we go with the "clean" ID_Start and ID_Continue implementation and have e.g. ES6/ES2015 extend those with the extra characters.

@mroeder
Copy link
Contributor

mroeder commented Dec 13, 2016

One more thing I stumbled upon yesterday, is that ident is already heavily used in grammars (Ohm itself uses it, so do some of @alexwarth teaching grammars). It would be a bigger change (bootstrapping involved) to change this and on top of that I am not sure we would want rule names to span the full spectrum of (Unicode) identifier names (some of them not being valid in ES5).

@pdubroy
Copy link
Contributor Author

pdubroy commented Dec 14, 2016

Re: _ and $, I'm a bit confused by that Unicode document, but I think I've figured it out. From http://unicode.org/reports/tr31/#R1:

Another such profile would be to include some set of the optional characters, for example:

  • Start := XID_Start, plus some characters from Table 3

...where Table 3 contains exactly _ and $. So what this tells me is that _ and $ do NOT have the XID_Start property, however, this standard suggests that some languages may want to explicit add those as valid identStart characters. The unicode-9.0.0 package seems to confirm this interpretation:

> var xid_start = require('unicode-9.0.0/Binary_Property/XID_Start/regex');
undefined
> !!xid_start.exec('a')
true
> !!xid_start.exec('_')
false
> !!xid_start.exec('$')
false

What confuses me about Python and Rust is that they do both allow leading underscores in variable names...yet claim that identStart is just XID_Start. Oh well.

Anyways, now that I feel like I understand this...I agree with what you said, that our identStart rule should be "clean" and not include _ and $.

@pdubroy
Copy link
Contributor Author

pdubroy commented Dec 14, 2016

Re: ident: I don't think it's a problem to introduce ident as a built-in rule. It would be a breaking change, but anyone who upgrades would get an error at grammar instantiation time, and could just change their definition to use :=.

But now that I think about it, maybe it's better to not include a built-in ident rule, because of things like $ and _, and other similar exceptions (- and * in Lisp, ' in ML, ? in Ruby, etc.). However I still think unicodeIdStart and unicodeIdContinue would still be useful.

Thoughts?

@mroeder
Copy link
Contributor

mroeder commented Jan 3, 2017

Well, we could put in an ident rule and describe it in the documentation. It should use identStart and identPart which just references unicodeIdStart and unicodeIdContinue by default. Any language that needs extra characters, like _ or $, could extend the identStart rule to include those as well.

@pdubroy
Copy link
Contributor Author

pdubroy commented Jan 5, 2017

Yes, that would work. However, I'd lean towards not doing it that way because we can't really provide a sensible default, and basically every grammar would need to override identStart.

This would mean that you'd need to understand rule overriding to make almost any useful grammar, which is not the case right now. (Hmmm, though I suppose to support comments you do need to extend the space rule. But ident is a bit trickier since you need to understand that the hidden definition of ident begins with identStart.)

@alexwarth
Copy link
Contributor

I agree. My vote would be to not have a built-in ident rule (because as Pat said, we can't provide a sensible default) but keep unicodeIdStart and unicodeIdContinue.

@mroeder
Copy link
Contributor

mroeder commented Jan 5, 2017

Currently, there is no default. Everyone has to implement an identifier rule on his/her own.
With ident & co. there would be a default. A good one, that sometimes would need be extended, especially if you try to build a parser for an existing language and obsess over 100% correctness (in which case you should/need to know about overriding/extending).
If you want to build your own language, which hopefully is the majority of cases, why wouldn't you go with a standardized default (that is better than letter alnum* you probably would come up with on your own)?

@alexwarth
Copy link
Contributor

While I agree that a built-in ident rule would be convenient, I worry about the problems and confusion that may result from programmers not fully understanding what it does. E.g., suppose I want to use $ as an operator in a new language,

Exp = ident `$` ident

but "for some reason" I can't parse foo$bar as an Exp. WAT!?!?

IMO the small amount of additional work that goes into writing your own ident rule more than pays for itself because it avoids this kind of confusion.

@pdubroy pdubroy changed the title Add built-in rules for ident, identStart, identPart Add built-in rules for Unicode ID_Start and ID_Continue Jan 6, 2017
@elgertam elgertam added the ecmascript Issues relating to parsing ES2015+ in examples/ecmascript/ label May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecmascript Issues relating to parsing ES2015+ in examples/ecmascript/ enhancement
Projects
None yet
Development

No branches or pull requests

4 participants