More UTF-16 support; simple Regexp transpiler #2610

hmdne · 2023-11-09T08:59:58Z

A little known thing about JavaScript is that it uses UTF-16
encoding for its strings. But to leverage full extent of UTF-16
support, one must use correct functions, otherwise we are left
with not supported over-the-BMP characters, like now ubiquitous
emoji.

This commit also makes most regexps use Unicode mode. Due to the
Unicode mode regexps being more strict, we now really need a half
a decent transpiler. That's also what it adds and using that
situation, we also add support for POSIX character classes, which
are quire often used in Ruby, but aren't there in JS, so we simulate
them with Unicode character classes.

As a side effect, this made us support value omission for hashes
when compiling with Opal in JS (eg. when using eval). Since all
the MSpec tests do this, we pass the tests now.

We also add a proper support for multiline regular expressions.
Semantics between how multiline works in Ruby and JS is very big,
as in, those are basically two different features. This commit
aims to reconcile those two features in the most straightforward
way. This commit introduces quite proper handling of all "\A",
"\z", "$", "^". It is our opinion, that a regexp will contain
only one set of those in which case things will work correctly.
If not, then we launch a warning.

Regexps are now annotated if needed. This means, that if a certain
regexp has been transpiled and the transpilation result differs,
the copy of the original Regexp will be preserved, so that further
manipulations on that Regexp, for instance Regexp.union, will
work on an original Regexp.

This PR has been sponsored by Ribose Inc.

- Full Oga compatibility, see: plurimath/plurimath#196 - Override in vendor/plurimath/lib/plurimath/math/symbol.rb - Ensure htmlentities works correctly, see: #16 - This also merges a patch for Opal, that includes more UTF-16 support: opal/opal#2610

hmdne · 2023-11-26T12:06:34Z

The performance impact must be investigated.

This patch cherry-picks updated version of opal/opal#2610. This update cleaned up Opal's regexp implementation quite a lot to fix opal/opal#2616 which in turn caused that we got 2 more tests passed on Plurimath and only the obvious failures remain on Parslet (that is after re-enabling some previously disabled tests, most of which pass now). This patch also removes the fix from #22, as it's not needed anymore.

This patch cherry-picks updated version of opal/opal#2610. This update cleaned up Opal's regexp implementation quite a lot to fix opal/opal#2616 which in turn caused that we got 2 more tests passed on Plurimath and only the obvious failures remain on Parslet (that is after re-enabling some previously disabled tests, most of which pass now). This patch also removes the fix from #22, as it's not needed anymore. This PR has been sponsored by Ribose Inc.

A little known thing about JavaScript is that it uses UTF-16 encoding for its strings. But to leverage full extent of UTF-16 support, one must use correct functions, otherwise we are left with not supported over-the-BMP characters, like now ubiquitous emoji. This commit also makes most regexps use Unicode mode. Due to the Unicode mode regexps being more strict, we now really need a half a decent transpiler. That's also what it adds and using that situation, we also add support for POSIX character classes, which are quire often used in Ruby, but aren't there in JS, so we simulate them with Unicode character classes. As a side effect, this made us support value omission for hashes when compiling with Opal in JS (eg. when using `eval`). Since all the MSpec tests do this, we pass the tests now. We also add a proper support for multiline regular expressions. Semantics between how multiline works in Ruby and JS is very big, as in, those are basically two different features. This commit aims to reconcile those two features in the most straightforward way. This commit introduces quite proper handling of all "\A", "\z", "$", "^". It is our opinion, that a regexp will contain only one set of those in which case things will work correctly. If not, then we launch a warning. Regexps are now annotated if needed. This means, that if a certain regexp has been transpiled and the transpilation result differs, the copy of the original Regexp will be preserved, so that further manipulations on that Regexp, for instance `Regexp.union`, will work on an original Regexp. This PR has been sponsored by Ribose Inc.

hmdne · 2023-11-29T14:00:57Z

The third iteration of this patch fixes a problem where a regexp like [^a] would be treated as containing ^ assertion.

This pull request takes an updated opal/opal#2610 - there was an issue in a previous version that treated `/[^"]/` as a multiline regexp which caused Parslet to produce invalid results. The other cases of this kind were fixed before. In addition this pull request cherry-picks opal/opal#2620 which fixes a mistake when trying to use `String#chars` on a String containing over-the-BMP characters, so characters with Unicode codepoints > 0xffff were handled incorrectly. While those don't currently happen in tests other than our unit test for an XML engine abstraction, having a support for those may be crucial in further development.

hmdne marked this pull request as draft November 9, 2023 10:18

hmdne force-pushed the hmdne/utf16 branch 2 times, most recently from 097a343 to d9407a9 Compare November 9, 2023 11:58

hmdne mentioned this pull request Nov 12, 2023

Combine work inside submodules plurimath/plurimath-js#18

Merged

hmdne force-pushed the hmdne/utf16 branch 2 times, most recently from 9fb815c to 380ca94 Compare November 26, 2023 11:57

hmdne mentioned this pull request Nov 26, 2023

Bug: Bad behaviour with combination of: StringScanner#scan, multiline regexp, multiple lines #2616

Open

hmdne mentioned this pull request Nov 28, 2023

Correct Opal regexp inner working to increase correctness. plurimath/plurimath-js#23

Merged

hmdne linked an issue Nov 28, 2023 that may be closed by this pull request

Bug: Bad behaviour with combination of: StringScanner#scan, multiline regexp, multiple lines #2616

Open

hmdne force-pushed the hmdne/utf16 branch from 380ca94 to b5f387e Compare November 29, 2023 13:59

hmdne mentioned this pull request Nov 29, 2023

Fix all the remaining test cases. This fixes #8. plurimath/plurimath-js#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More UTF-16 support; simple Regexp transpiler #2610

More UTF-16 support; simple Regexp transpiler #2610

hmdne commented Nov 9, 2023 •

edited

hmdne commented Nov 26, 2023

hmdne commented Nov 29, 2023

More UTF-16 support; simple Regexp transpiler #2610

Are you sure you want to change the base?

More UTF-16 support; simple Regexp transpiler #2610

Conversation

hmdne commented Nov 9, 2023 • edited

hmdne commented Nov 26, 2023

hmdne commented Nov 29, 2023

hmdne commented Nov 9, 2023 •

edited