Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for running BiDi algorithm on text with styling annotations. #10

Merged
merged 2 commits into from Jul 27, 2018

Conversation

ChrisLoer
Copy link
Contributor

  • Upgrade to Emscripten 1.38 (requires explicit WASM=0 to build asm.js)
  • Upgrade to ICU 62.1
  • Adds processStyledBidirectionalText that takes an array of style indices in parallel to the input text and returns results annotated with the correct (reordered) indices
  • Add basic test for processStyledBidirectionalText that that splits contiguous input style sections into discontinguous output sections
  • Regenerate checked-in build outputs.

Note the test strings in arabic.test.js are mind-bending and may break layout in your text viewer (that includes GitHub's preview: those arrays that look broken aren't).

The previous approach (which we still use for unstyled text) was:

  • Run BiDi on the whole paragaph
  • Determine line breaks
  • Run ICU's setLine with chosen line break points to prepare individual lines
  • Run ICU's writeReordered to correctly transform/reorder all logical runs within the line into visual runs.

The new approach for styled lines is:

  • Run BiDi on the whole paragaph
  • Determine line breaks
  • Run ICU's setLine with chosen line break points to prepare individual lines
  • Use ICU's getVisualRun to iterate over directional runs within a line in visual order. For each run:
    • if it's LTR, just copy it into the output, no transformations are needed
    • if it's RTL, iterate over it logically backwards (visually forwards), breaking into segments by style. For each segment, use ICU's writeReverse to reverse the section and apply transformations. This may change the length of the segment, but since the whole segment shares the same style, we can just apply that style to every character in whatever the reversed output is. Note that writeReverse is responsible for keeping combining characters together in the right (aka still "logical") order. If you changed style mid-combining character, this algorithm could move the combining character to the wrong place, and you would deserve whatever happened to you.

One open question in my mind is whether the style annotation interface makes sense (an array of annotations maintained in parallel with a flat string). Alternatively, we could represent the input in terms of sections (aka instead of representing a section "foo" with style 0 as ["foo", [0, 0, 0]], we could represent it as ["foo",0]). Internally, it was easier for me to work with a data structure that made it super-easy to go from "source string logical index" -> "style", but it'd be easy enough to do the conversion. The thing is, I'm not sure an annotated-per-section interface is actually easier to work with, because the caller then has to keep track of iterating over sections too, and it's confusing that a contiguous input section can actually be split up into multiple discontiguous output sections...

/cc @anandthakker

 - Upgrade to Emscripten 1.38 (requires explicit WASM=0 to build asm.js)
 - Upgrade to ICU 62.1
 - Adds "processStyledBidirectionalText" that takes an array of style indices in parallel to the input text and returns results annotated with the correct (reordered) indices
 - Add basic test for processStyledBidirectionalText that that splits contiguous input style sections into discontinguous output sections
 - Regenerate checked-in build outputs.
@ChrisLoer
Copy link
Contributor Author

Oh also this is meant to be backwards compatible with gl-js in both directions. Old versions that load this should just ignore processStyledBidirectionalText. New versions of gl-js that load an old version of the rtl-text plugin will still be able to run BiDi for unstyled text, but won't run it for styled text (see shaping.js logic at mapbox/mapbox-gl-js#6994).

Copy link

@anandthakker anandthakker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall to me!

The thing is, I'm not sure an annotated-per-section interface is actually easier to work with, because the caller then has to keep track of iterating over sections too, and it's confusing that a contiguous input section can actually be split up into multiple discontiguous output sections

Yeah, I agree that the per-section approach doesn't sound particularly easier to work with. It would presumably take less space, though, right? Not sure if the difference would be enough to matter...

src/icu.js Outdated
return Module._malloc(4);
}

function readInt32Ptr(ptr) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to call this something like readAndFreeInt32Ptr or consumeInt32Ptr?

src/icu.js Outdated
return reversed;
}

function processStyledBidirectionalText(text, styleIndices, lineBreakPoints) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add brief comment describing what styleIndices should be and what the return value looks like

src/icu.js Outdated

const mergedParagraphLineBreakPoints = mergeParagraphLineBreakPoints(lineBreakPoints, paragraphCount);

let startIndex = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lineStartIndex?

@ChrisLoer
Copy link
Contributor Author

It would presumably take less space, though, right? Not sure if the difference would be enough to matter...

Yeah, I'm going to start with the assumption that it doesn't matter. We're not holding onto these arrays, so the cost would just show up... I dunno, maybe with the wasm version of mapbox-gl-rtl-text the overhead of transferring the array would be greater? But it seems reasonable to ignore unless it actually shows up as significant in profiling.

@ChrisLoer
Copy link
Contributor Author

Huh, ICU's download site appears to be down, which breaks the build. I'll try again later.

@ChrisLoer ChrisLoer merged commit 97ff02e into master Jul 27, 2018
@ChrisLoer ChrisLoer deleted the styled-text branch July 27, 2018 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants