Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTL Language Support #2107

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

simonberger
Copy link
Member

@simonberger simonberger commented Mar 7, 2020

This feature adds:

  • Text is aligned right when direction is set to RTL
  • Text of RTL languages is reordered while LTR languages should be untouched
  • Dir attribute is supported beside the direction style on the most important elements

This does not use the ar-php library which was unstable in my tests and provided often wrong results. Instead it includes https://github.com/tecnickcom/tc-lib-unicode which looks promising.

I have zero knowledge of all languages like Arabic, Hebrew, Persian this supports. If someone has some more insights and could find flaws of this implementation or on the use of this library instead of ar-php please let us know here.

Known issues:

  • If inline blocks are used to decorate text for example, this block isn't correctly reordered in the line. Trying to move even these blocks with its contents could be a very troublesome job. It is caused by how dompdf internally works. I'll still have a look and maybe someone has an idea.

- Text aligned right when direction is set to rtl
- Text of RTL languages is reordered while LTR languages should be untouched
Copy link
Member

@bsweeney bsweeney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice addition. Nicola has done a lot of good work in this area and is very active in maintaining his projects so I think using his libraries is a solid choice.

The basic implementation looks solid so I'm fine moving forward with what you have even if you don't work out some of the kinks.

src/Css/Style.php Outdated Show resolved Hide resolved
@bsweeney
Copy link
Member

bsweeney commented Mar 8, 2020

What were your thoughts regarding block rendering. I was thinking we render from the last child instead of the first child for containers where direction is "rtl". But it would require knowing which children fit on the line first rather than calculating as we go.

@simonberger
Copy link
Member Author

simonberger commented Mar 8, 2020

I did not really think about it in detail. The problem is even more complex though when trying to implement it perfectly.

This line is rather simple:

<p style="direction: rtl">
	12 <span style="color: red">34</span> 56
</p>

With the current implementation we get 21 34 65 due to a bug in the library I guess will be fixed soon.
But we want to get 56 34 12 anyway.

The approach to just reorder the blocks of one line (if we would be able to solve the problem you described) would produce a wrong result here:

<p style="direction: rtl">
	12<span style="color: red">34</span>56
</p>

We want an unchanged order of 123456. The same belongs not just to english parts in an RTL block but also to the RTL languages as far as I understand where (some) words or glyphs are not reordered.

My initial thoughts were to add the changes even earlier in the dompdf process and passing the content of the whole block into the Bidi calculation. and rearranging the inline blocks accordingly.
But as wrote I did not evaluate if this is feasible nor how much effort this would cost.

@simonberger
Copy link
Member Author

Starting with this very basic support (after the aligning fix) would be a good idea in my opinion. There are so many things to do or consider having a better implementation. Reverting order of table columns is just one more coming to my mind.
We could try to tackle it in a later version.

@simonberger
Copy link
Member Author

Addresses:
#1009
#2103

I do not completely understand the issue in #712 but that shouldn't be fixed with this right?

@bsweeney
Copy link
Member

I was looking at this and I have some ideas about 712. Let me collect my thoughts and get back to you.

@maziyaryosofi
Copy link

it isn't work on the persian language.
also have problem with word and align:right also not working.
in this method the text like سلام print as مالس

@simonberger
Copy link
Member Author

simonberger commented Mar 14, 2020

Then you most likely did not set a direction: rtl css style
The characters aren't joined though which is the problem mentioned in #712.

@simonberger
Copy link
Member Author

Activated shaping of the tc-lib-unicode library in 62e7b36 which should fix this and also the issues in #712
I deactivated this initially because of some results that seemed not correct to me and now somehow almost forgot about it. On this example it produces a correct result.
We need some more testing.

@ThomasCaud
Copy link

Hello all. First of all, thank you for your work and support! I just would like to know if you know when the dompdf version will be available with this feature? I can see the PR is approved now. :)
Thomas

@simonberger
Copy link
Member Author

We still had one issue which now is fixed. (#2134, #2135)
This will be merged when #2134 is finished.

Releasing is managed by @bsweeney. From my side I'll just try to merge all my other pending pull requests and won't start something else for the next release.

@ThomasCaud
Copy link

Thanks for these informations. :)

@bsweeney
Copy link
Member

bsweeney commented May 2, 2020

@simonberger I went ahead and resolved a conflict in the Style class so this PR should be up-to-date with the changes I made.

@simonberger
Copy link
Member Author

simonberger commented May 3, 2020

@bsweeney
Thank you for Merging. Results are similar to how our merge branch looked like which is good.
From my side we could merge this now.
Do you want to have another look at #712 and/or did you test my change activating shaping?

@bsweeney
Copy link
Member

bsweeney commented May 3, 2020

I'll take another quick look, but I think we're probably good to merge.

@bsweeney bsweeney added this to the 0.8.6 milestone May 8, 2020
Copy link
Member

@bsweeney bsweeney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I've finally had a chance to look at this and toy around with the rendering and thinking about this feature some more. Sorry for the late detail analysis, there's a lot to discuss here so strap yourself in. Or scroll to the bottom for the tl;dr

I think the overall idea works but we need to tweak to the specifics of the implementation. I found three issues that should be addressable without too much work. I haven't made any changes because I wanted to get your feedback.

First, with this implementation you have to disable font subsetting. Otherwise, the rendered PDF shows the unknown-glyph character for shaped text. The reason is that the character subset in the document is determined before layout (ref). This is problematic because text shaping basically replaces the typed character with the shaped version. So, what we have in the font subset is the typed characters when we really need the shaped characters.

The quickest fix would be to modify the font subsetting logic to parse RTL text through tc-lib-unicode prior to registering it in the font subset. This would result in a performance hit, though, because then we'll be parsing the text with tc-lib-unicode twice.

I believe the long term solution to the subsetting issue is to move the logic into CPDF. This should give us better performance and more reliable output by making subsetting part of the CPDF rendering ("out") process.

Second, I'm pretty sure that even if text is not styled RTL that it should still be rendered with shaping. We could go all out and perform subsetting, directionalization, and shaping of the text prior to rendering (i.e. in the location referenced above where we currently do subsetting). It's a pretty significant change, though, because the actual document would have to be manipulated to make this work. As such this probably requires a larger discussion and may be best left to a future release. I think short term I'm fine limiting shaping to RTL text.

Long term, and ignoring for now discussion of the process by which we determine to what text we apply shaping, I think we can also resolve this issue by moving the logic into the PDF library (same as subsetting). It seems like the appropriate separation of concerns that the HTML+CSS rendering engine determines the where of rendering and the PDF rendering engine determines the what. With caveats that we would have to figure out as we analyze the issue more.

Third, the current implementation works well for short amounts of text but appears to break when the text flows to a new line. We're performing the directionality parsing too early in the process and, as a result, the text is reversed prior to reflowing within the document. So words that should appear on the first line appear, instead, on the last line.

For example, the following HTML:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=UTF-8">

    <link href="https://fonts.googleapis.com/css2?family=Amiri&display=swap" rel="stylesheet">
    <style type="text/css">
        body{
            direction: rtl;
            font-family: Amiri, DejaVu Sans, monospace;
            font-size: 36pt;
        }
    </style>
</head>
<body>
  <p>
    این یک متن تست جهت بررسی موضوع
    12 34 56
    <br>
    <span>
      این یک متن تست جهت بررسی موضوع
      12 34 56
    </span>
  </p>
</body>
</html>

looks as follows in the browser vs PDF:

rtl-line-wrapping
(highlighted text shows how particular words are placed incorrectly).

I think to address this issue we just need to move tc-lib-unicode usage later in the process. Maybe make it part of the canvas method and update the text methods with an argument indicating whether or not the text should be directionalized and shaped before being placed in the PDF.

This is probably one more areas where moving the actual text parsing logic into CPDF makes sense.


tl;dr long term my thought is we move a lot of the actual unicode parsing into CPDF. That seems like a lot more work than the current implementation, so I think for the next release we can better accommodate this feature by doing the following:

a. add a first round of text shaping to the font subset routine when the text is styled RTL so that the subsetted font contains the correct characters
b. keep the current logic in the text reflower but pass in "L" for the forcertl argument so that we transform the text early enough for reflow but don't reverse line order
c. add logic to the canvas adapter text methods to specify the directionality of the text and have those methods perform the RTL transformation

$text = $this->_collapse_white_space($text);

if ($style->direction === 'rtl') {
$bidi = new Bidi($text, null, null, 'R', true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of passing in 'R' for forcertl we should pass in false. tc-lib-unicode appears to have some logic (ref) to determine if RTL is needed for the passed-in text. I did some basic testing using the following HTML and it seems to work as expected:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=UTF-8">

    <link href="https://fonts.googleapis.com/css2?family=Amiri&display=swap" rel="stylesheet">
    <style type="text/css">
        body{
            direction: rtl;
            font-family: Amiri, DejaVu Sans, monospace;
            font-size: 36pt;
        }
    </style>
</head>
<body>
<div class="control-body">
  <p>
    12 <span style="color: red">34</span> 56
  </p>
  <p>
    این یک متن تست جهت بررسی موضوع
  </p>
</div>
</body>
</html>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case a right-to-left direction is request I think it is safe to force it.
Which problems do you want to solve letting tc-lib-unciode decide what to do?

If we are doing the step to pass all text it is another story surely. But I don't know if browsers are having an auto detection. I doubt that as there is just the CSS value ltr or rtl.

EDIT:
In fact the browsers indeed are applying a RTL text transform and a missing direction:rtl style just is not doing a right aligning.

EDIT 2:
Even direction:ltr is not stopping the browsers from transforming the text. This style is moreover changing structure ordering (table columns, floating?, text aligning...) rather than the text itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this initial release focused on the text itself. We can focus a future release on layout modification for rtl content. Conceptually this seems straightforward, just position from the right side of the container instead of the left. I think probably what will make it difficult is locating all the places where we need to add logic for this.

@simonberger
Copy link
Member Author

simonberger commented May 8, 2020

@bsweeney Thank you for the detailed review. That's what I was hoping for.

(1) Font subsetting
Good point. First a question. Isn't it generally possible (needed effort aside) to create the embedded font object dynamically or at the end or does this have to be complete at the start?

To fix that in the current state we have some options beside the tc-unicode-lib double run:

  • Disable font subsetting if we have any RTL style (no preferred solution)
  • Include all chars that could be added by shaping (https://github.com/tecnickcom/tc-lib-unicode-data/blob/develop/src/Arabic.php). This could be a good compromise but from a quick look it does not seem to work that well as the font subsetting is looking at the text instead of caring at the first place which chars the font provides. But we can include the possible substitutions whenever we hit a "source-glyph".
  • Running tc-lib-unicode or a sub process of it as you mentioned

(2) Always apply shaping
Thought about this as well. But we maybe should do that if we are sure it works well. Also we might enable processing it just by option.
On the other hand I don't know when and why shaping should have impact on non RTL text. It manipulates just an assessable number of glyphs.

(3) Breaking multiline RTL text
Took me some time to get the problem. [offtopic]It is quiet annoying to work on this with debugging tools as they switch are applying RTL direction to debugging state while it is off in the editor[/offtopic].

While writing an answer here I did a change which should work better:
7680338

I don't know how hard I thought about the place I set the bidi transform. I remember the idea was to be able to still do a split in case the text exceeds the limit just after the transform (I doubt that's really needed) but because the transform changes the order of the words the order gets broken if a line break happens.

@simonberger
Copy link
Member Author

Found the code comments suggesting the move of the font subsetting from rendering to output time.
I evaluated the change and for my knowledge of the code almost the whole logic of the selectFontmethod needs to move because of usage of the data probably coming from subsetting. But that should feasible.

@bsweeney
Copy link
Member

Regarding subsetting: So as you can see the whole reason we do subsetting the way we do is because the subset is built at the time the font is selected. I think the change would actually be fairly straightforward, but that's based on preliminary analysis. From what I see we would need to

  1. Remove the font subsetting logic from Dompdf.cls.php and then within the CPDF class call registerText from the addText method.
  2. In CPDF move the actual subsetting logic from the selectFont method to the "out" action of the o_font method.

There may be additional work, but that's what I see right now. We should also look at fixing up font subsetting support in PDFLib as I'm not sure we're doing it at all right now (I think all we need to do is set the global subsetting option).

This should probably be done in a separate branch. And we should probably do this for the next release if the totality of the work is as I've outlined above.

@bsweeney
Copy link
Member

bsweeney commented May 17, 2020

With the most recent change to the code I'm seeing an odd character replacement with the sample I posted above using the Amiri font. Spaces are being replaced by exclamations when font subsetting is enabled. This isn't a global issue, either, because DejaVu Sans is rendered correctly.

I checked the develop branch and it's happening there as well. I must have missed this issue when I was reviewing the subset-focused pull request.

--EDIT--
Maybe it's something going on with my system because I get that issue even if I render with the master branch.

@simonberger
Copy link
Member Author

Works for me. Damaged font cache maybe?

The bidi transform may reverse whitespace from start to end or vice versa.
This is restoring the initial whitespace mainly working around a bug in the bidi transform reversing the WS even on latin text and numbers.
@sbc640964
Copy link

When will it be appropriate for Hebrew?

@simonberger
Copy link
Member Author

@sbc640964 It should work for Hebrew text.
If you test it please leave some feedback if it works well or if you have problems.
We have no language knowledge to really test it.

@sbc640964
Copy link

sbc640964 commented Jun 29, 2020

@simonberger
I try unsuccessfully.
I am adding the
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

and this the document

<style>
  @font-face {
       font-family: 'Assistant';
       font-style: normal;
       font-weight: normal;
       src: url("the/path/to/ttf/file") format('truetype');
   }
</style>
<h2 class="text-info pb-5 font-weight-normal"  style="font-family: 'Assistant', sans-serif;">זו הכותרת בעברית</h2>

and this is the result
image

@simonberger
Copy link
Member Author

simonberger commented Jun 29, 2020

@sbc640964
Something is wrong with your styles or your font file.

Here is a working sample with your text:

<html>
<head>
    <link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Assistant&display=swap">
</head>
<body>
    <div style="font-family: 'Assistant', sans-serif;">זו הכותרת בעברית</div>
</body>
</html>

EDIT:
That Hebrew text is displayed correctly is nothing this feature changes. It works already in 0.8.5. This pull request applies right to left rendering and right aligning.

@bsweeney
Copy link
Member

@simonberger I think that text is supposed to be rendered RTL due to strong directionality of the characters. This goes back to our discussion of enabling the bidi parsing at all time. This is a tweak I was looking at making against your branch but haven't gotten very far with (and we need to discuss more anyway).

@sbc640964 this branch is still a work-in-progress and we still have much to work out. If you set the direction of the text to RTL you'll see it rendered in the order expected. The directionality logic is only enabled when RTL is specified.

<html>
<head>
    <link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Assistant&display=swap">
</head>
<body>
    <div style="font-family: 'Assistant', sans-serif; direction: rtl;">זו הכותרת בעברית</div>
</body>
</html>

@bsweeney bsweeney modified the milestones: dompdf-next, 2.0.0 Aug 24, 2021
@bsweeney bsweeney linked an issue Oct 27, 2021 that may be closed by this pull request
18 tasks
@bsweeney bsweeney modified the milestones: 2.0.0, 2.1.0 Nov 8, 2021
@bsweeney bsweeney changed the base branch from develop to master November 8, 2021 17:09
@kory-northrop
Copy link

kory-northrop commented Nov 24, 2021

I have been testing out these commits on my staging site and am finding a similar issue about the RTL rendering improperly if the text string is too long to fit onto one line. This was mentioned by @bsweeney in their comment above (#2107 (review)).

Additionally, I'm finding that the unordered list decoration is not appearing before the list item text (which should be the right side of the page) instead it is staying to the left side of the page. Did anyone else have that issue?

@learn05
Copy link

learn05 commented Mar 31, 2022

Solution is here Click Here

@bsweeney bsweeney mentioned this pull request Dec 10, 2022
@bsweeney bsweeney mentioned this pull request Mar 17, 2024
bsweeney added a commit that referenced this pull request Mar 19, 2024
per #2107
bsweeney added a commit that referenced this pull request Mar 19, 2024
per #2107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support complex text layout (bidi, shaping, ordering)
7 participants