New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for running BiDi algorithm on text with styling annotations. #10
Conversation
- Upgrade to Emscripten 1.38 (requires explicit WASM=0 to build asm.js) - Upgrade to ICU 62.1 - Adds "processStyledBidirectionalText" that takes an array of style indices in parallel to the input text and returns results annotated with the correct (reordered) indices - Add basic test for processStyledBidirectionalText that that splits contiguous input style sections into discontinguous output sections - Regenerate checked-in build outputs.
Oh also this is meant to be backwards compatible with gl-js in both directions. Old versions that load this should just ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall to me!
The thing is, I'm not sure an annotated-per-section interface is actually easier to work with, because the caller then has to keep track of iterating over sections too, and it's confusing that a contiguous input section can actually be split up into multiple discontiguous output sections
Yeah, I agree that the per-section approach doesn't sound particularly easier to work with. It would presumably take less space, though, right? Not sure if the difference would be enough to matter...
src/icu.js
Outdated
return Module._malloc(4); | ||
} | ||
|
||
function readInt32Ptr(ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to call this something like readAndFreeInt32Ptr
or consumeInt32Ptr
?
src/icu.js
Outdated
return reversed; | ||
} | ||
|
||
function processStyledBidirectionalText(text, styleIndices, lineBreakPoints) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add brief comment describing what styleIndices
should be and what the return value looks like
src/icu.js
Outdated
|
||
const mergedParagraphLineBreakPoints = mergeParagraphLineBreakPoints(lineBreakPoints, paragraphCount); | ||
|
||
let startIndex = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe lineStartIndex
?
Yeah, I'm going to start with the assumption that it doesn't matter. We're not holding onto these arrays, so the cost would just show up... I dunno, maybe with the wasm version of mapbox-gl-rtl-text the overhead of transferring the array would be greater? But it seems reasonable to ignore unless it actually shows up as significant in profiling. |
Huh, ICU's download site appears to be down, which breaks the build. I'll try again later. |
processStyledBidirectionalText
that takes an array of style indices in parallel to the input text and returns results annotated with the correct (reordered) indicesprocessStyledBidirectionalText
that that splits contiguous input style sections into discontinguous output sectionsNote the test strings in arabic.test.js are mind-bending and may break layout in your text viewer (that includes GitHub's preview: those arrays that look broken aren't).
The previous approach (which we still use for unstyled text) was:
setLine
with chosen line break points to prepare individual lineswriteReordered
to correctly transform/reorder all logical runs within the line into visual runs.The new approach for styled lines is:
setLine
with chosen line break points to prepare individual linesgetVisualRun
to iterate over directional runs within a line in visual order. For each run:writeReverse
to reverse the section and apply transformations. This may change the length of the segment, but since the whole segment shares the same style, we can just apply that style to every character in whatever the reversed output is. Note thatwriteReverse
is responsible for keeping combining characters together in the right (aka still "logical") order. If you changed style mid-combining character, this algorithm could move the combining character to the wrong place, and you would deserve whatever happened to you.One open question in my mind is whether the style annotation interface makes sense (an array of annotations maintained in parallel with a flat string). Alternatively, we could represent the input in terms of sections (aka instead of representing a section "foo" with style 0 as
["foo", [0, 0, 0]]
, we could represent it as["foo",0]
). Internally, it was easier for me to work with a data structure that made it super-easy to go from "source string logical index" -> "style", but it'd be easy enough to do the conversion. The thing is, I'm not sure an annotated-per-section interface is actually easier to work with, because the caller then has to keep track of iterating over sections too, and it's confusing that a contiguous input section can actually be split up into multiple discontiguous output sections.../cc @anandthakker