Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More UTF-16 support; simple Regexp transpiler #2610

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

hmdne
Copy link
Member

@hmdne hmdne commented Nov 9, 2023

A little known thing about JavaScript is that it uses UTF-16
encoding for its strings. But to leverage full extent of UTF-16
support, one must use correct functions, otherwise we are left
with not supported over-the-BMP characters, like now ubiquitous
emoji.

This commit also makes most regexps use Unicode mode. Due to the
Unicode mode regexps being more strict, we now really need a half
a decent transpiler. That's also what it adds and using that
situation, we also add support for POSIX character classes, which
are quire often used in Ruby, but aren't there in JS, so we simulate
them with Unicode character classes.

As a side effect, this made us support value omission for hashes
when compiling with Opal in JS (eg. when using eval). Since all
the MSpec tests do this, we pass the tests now.

We also add a proper support for multiline regular expressions.
Semantics between how multiline works in Ruby and JS is very big,
as in, those are basically two different features. This commit
aims to reconcile those two features in the most straightforward
way. This commit introduces quite proper handling of all "\A",
"\z", "$", "^". It is our opinion, that a regexp will contain
only one set of those in which case things will work correctly.
If not, then we launch a warning.

Regexps are now annotated if needed. This means, that if a certain
regexp has been transpiled and the transpilation result differs,
the copy of the original Regexp will be preserved, so that further
manipulations on that Regexp, for instance Regexp.union, will
work on an original Regexp.

This PR has been sponsored by Ribose Inc.

@hmdne hmdne marked this pull request as draft November 9, 2023 10:18
@hmdne hmdne force-pushed the hmdne/utf16 branch 2 times, most recently from 097a343 to d9407a9 Compare November 9, 2023 11:58
hmdne added a commit to plurimath/plurimath-js that referenced this pull request Nov 12, 2023
- Full Oga compatibility, see:
  plurimath/plurimath#196
- Override in vendor/plurimath/lib/plurimath/math/symbol.rb
- Ensure htmlentities works correctly, see:
  #16
  - This also merges a patch for Opal, that includes more UTF-16 support:
    opal/opal#2610
@hmdne hmdne force-pushed the hmdne/utf16 branch 2 times, most recently from 9fb815c to 380ca94 Compare November 26, 2023 11:57
@hmdne
Copy link
Member Author

hmdne commented Nov 26, 2023

The performance impact must be investigated.

hmdne added a commit to plurimath/plurimath-js that referenced this pull request Nov 28, 2023
This patch cherry-picks updated version of opal/opal#2610. This
update cleaned up Opal's regexp implementation quite a lot to fix
opal/opal#2616 which in turn caused that we got 2 more tests passed
on Plurimath and only the obvious failures remain on Parslet (that
is after re-enabling some previously disabled tests, most of which
pass now). This patch also removes the fix from #22, as it's not
needed anymore.
hmdne added a commit to plurimath/plurimath-js that referenced this pull request Nov 28, 2023
This patch cherry-picks updated version of opal/opal#2610. This
update cleaned up Opal's regexp implementation quite a lot to fix
opal/opal#2616 which in turn caused that we got 2 more tests passed
on Plurimath and only the obvious failures remain on Parslet (that
is after re-enabling some previously disabled tests, most of which
pass now). This patch also removes the fix from #22, as it's not
needed anymore.

This PR has been sponsored by Ribose Inc.
A little known thing about JavaScript is that it uses UTF-16
encoding for its strings. But to leverage full extent of UTF-16
support, one must use correct functions, otherwise we are left
with not supported over-the-BMP characters, like now ubiquitous
emoji.

This commit also makes most regexps use Unicode mode. Due to the
Unicode mode regexps being more strict, we now really need a half
a decent transpiler. That's also what it adds and using that
situation, we also add support for POSIX character classes, which
are quire often used in Ruby, but aren't there in JS, so we simulate
them with Unicode character classes.

As a side effect, this made us support value omission for hashes
when compiling with Opal in JS (eg. when using `eval`). Since all
the MSpec tests do this, we pass the tests now.

We also add a proper support for multiline regular expressions.
Semantics between how multiline works in Ruby and JS is very big,
as in, those are basically two different features. This commit
aims to reconcile those two features in the most straightforward
way. This commit introduces quite proper handling of all "\A",
"\z", "$", "^". It is our opinion, that a regexp will contain
only one set of those in which case things will work correctly.
If not, then we launch a warning.

Regexps are now annotated if needed. This means, that if a certain
regexp has been transpiled and the transpilation result differs,
the copy of the original Regexp will be preserved, so that further
manipulations on that Regexp, for instance `Regexp.union`, will
work on an original Regexp.

This PR has been sponsored by Ribose Inc.
@hmdne
Copy link
Member Author

hmdne commented Nov 29, 2023

The third iteration of this patch fixes a problem where a regexp like [^a] would be treated as containing ^ assertion.

hmdne added a commit to plurimath/plurimath-js that referenced this pull request Nov 29, 2023
This pull request takes an updated opal/opal#2610 - there was an
issue in a previous version that treated `/[^"]/` as a multiline
regexp which caused Parslet to produce invalid results. The other
cases of this kind were fixed before.

In addition this pull request cherry-picks opal/opal#2620 which
fixes a mistake when trying to use `String#chars` on a String
containing over-the-BMP characters, so characters with Unicode
codepoints > 0xffff were handled incorrectly. While those don't
currently happen in tests other than our unit test for an XML
engine abstraction, having a support for those may be crucial
in further development.
hmdne added a commit to plurimath/plurimath-js that referenced this pull request Nov 29, 2023
This pull request takes an updated opal/opal#2610 - there was an
issue in a previous version that treated `/[^"]/` as a multiline
regexp which caused Parslet to produce invalid results. The other
cases of this kind were fixed before.

In addition this pull request cherry-picks opal/opal#2620 which
fixes a mistake when trying to use `String#chars` on a String
containing over-the-BMP characters, so characters with Unicode
codepoints > 0xffff were handled incorrectly. While those don't
currently happen in tests other than our unit test for an XML
engine abstraction, having a support for those may be crucial
in further development.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Bad behaviour with combination of: StringScanner#scan, multiline regexp, multiple lines
1 participant