New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More UTF-16 support; simple Regexp transpiler #2610
Draft
hmdne
wants to merge
1
commit into
master
Choose a base branch
from
hmdne/utf16
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hmdne
force-pushed
the
hmdne/utf16
branch
2 times, most recently
from
November 9, 2023 11:58
097a343
to
d9407a9
Compare
hmdne
added a commit
to plurimath/plurimath-js
that referenced
this pull request
Nov 12, 2023
- Full Oga compatibility, see: plurimath/plurimath#196 - Override in vendor/plurimath/lib/plurimath/math/symbol.rb - Ensure htmlentities works correctly, see: #16 - This also merges a patch for Opal, that includes more UTF-16 support: opal/opal#2610
hmdne
force-pushed
the
hmdne/utf16
branch
2 times, most recently
from
November 26, 2023 11:57
9fb815c
to
380ca94
Compare
The performance impact must be investigated. |
hmdne
added a commit
to plurimath/plurimath-js
that referenced
this pull request
Nov 28, 2023
This patch cherry-picks updated version of opal/opal#2610. This update cleaned up Opal's regexp implementation quite a lot to fix opal/opal#2616 which in turn caused that we got 2 more tests passed on Plurimath and only the obvious failures remain on Parslet (that is after re-enabling some previously disabled tests, most of which pass now). This patch also removes the fix from #22, as it's not needed anymore.
hmdne
added a commit
to plurimath/plurimath-js
that referenced
this pull request
Nov 28, 2023
This patch cherry-picks updated version of opal/opal#2610. This update cleaned up Opal's regexp implementation quite a lot to fix opal/opal#2616 which in turn caused that we got 2 more tests passed on Plurimath and only the obvious failures remain on Parslet (that is after re-enabling some previously disabled tests, most of which pass now). This patch also removes the fix from #22, as it's not needed anymore. This PR has been sponsored by Ribose Inc.
A little known thing about JavaScript is that it uses UTF-16 encoding for its strings. But to leverage full extent of UTF-16 support, one must use correct functions, otherwise we are left with not supported over-the-BMP characters, like now ubiquitous emoji. This commit also makes most regexps use Unicode mode. Due to the Unicode mode regexps being more strict, we now really need a half a decent transpiler. That's also what it adds and using that situation, we also add support for POSIX character classes, which are quire often used in Ruby, but aren't there in JS, so we simulate them with Unicode character classes. As a side effect, this made us support value omission for hashes when compiling with Opal in JS (eg. when using `eval`). Since all the MSpec tests do this, we pass the tests now. We also add a proper support for multiline regular expressions. Semantics between how multiline works in Ruby and JS is very big, as in, those are basically two different features. This commit aims to reconcile those two features in the most straightforward way. This commit introduces quite proper handling of all "\A", "\z", "$", "^". It is our opinion, that a regexp will contain only one set of those in which case things will work correctly. If not, then we launch a warning. Regexps are now annotated if needed. This means, that if a certain regexp has been transpiled and the transpilation result differs, the copy of the original Regexp will be preserved, so that further manipulations on that Regexp, for instance `Regexp.union`, will work on an original Regexp. This PR has been sponsored by Ribose Inc.
The third iteration of this patch fixes a problem where a regexp like |
hmdne
added a commit
to plurimath/plurimath-js
that referenced
this pull request
Nov 29, 2023
This pull request takes an updated opal/opal#2610 - there was an issue in a previous version that treated `/[^"]/` as a multiline regexp which caused Parslet to produce invalid results. The other cases of this kind were fixed before. In addition this pull request cherry-picks opal/opal#2620 which fixes a mistake when trying to use `String#chars` on a String containing over-the-BMP characters, so characters with Unicode codepoints > 0xffff were handled incorrectly. While those don't currently happen in tests other than our unit test for an XML engine abstraction, having a support for those may be crucial in further development.
hmdne
added a commit
to plurimath/plurimath-js
that referenced
this pull request
Nov 29, 2023
This pull request takes an updated opal/opal#2610 - there was an issue in a previous version that treated `/[^"]/` as a multiline regexp which caused Parslet to produce invalid results. The other cases of this kind were fixed before. In addition this pull request cherry-picks opal/opal#2620 which fixes a mistake when trying to use `String#chars` on a String containing over-the-BMP characters, so characters with Unicode codepoints > 0xffff were handled incorrectly. While those don't currently happen in tests other than our unit test for an XML engine abstraction, having a support for those may be crucial in further development.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A little known thing about JavaScript is that it uses UTF-16
encoding for its strings. But to leverage full extent of UTF-16
support, one must use correct functions, otherwise we are left
with not supported over-the-BMP characters, like now ubiquitous
emoji.
This commit also makes most regexps use Unicode mode. Due to the
Unicode mode regexps being more strict, we now really need a half
a decent transpiler. That's also what it adds and using that
situation, we also add support for POSIX character classes, which
are quire often used in Ruby, but aren't there in JS, so we simulate
them with Unicode character classes.
As a side effect, this made us support value omission for hashes
when compiling with Opal in JS (eg. when using
eval
). Since allthe MSpec tests do this, we pass the tests now.
We also add a proper support for multiline regular expressions.
Semantics between how multiline works in Ruby and JS is very big,
as in, those are basically two different features. This commit
aims to reconcile those two features in the most straightforward
way. This commit introduces quite proper handling of all "\A",
"\z", "$", "^". It is our opinion, that a regexp will contain
only one set of those in which case things will work correctly.
If not, then we launch a warning.
Regexps are now annotated if needed. This means, that if a certain
regexp has been transpiled and the transpilation result differs,
the copy of the original Regexp will be preserved, so that further
manipulations on that Regexp, for instance
Regexp.union
, willwork on an original Regexp.
This PR has been sponsored by Ribose Inc.