Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved support for non-Latin language inclusion in report PDFs, and validate translations for new phrases and new languages #417

Closed
3 of 4 tasks
carlhiggs opened this issue Apr 24, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@carlhiggs
Copy link
Member

carlhiggs commented Apr 24, 2024

When producing our 25 city reports, we experienced issues supporting some non-Latin scripts like Tamil:
global-healthy-liveable-cities/global_scorecards#7

However, the PDF templating software we use (fpdf2) has recently implemented changes that should provide better support for non-Latin languages:
https://py-pdf.github.io/fpdf2/Unicode.html#note-on-non-latin-languages

To take advantage of this though, we will need to implement and test changes to our software:

  • text shaping requires installation of uharfbuzz (which is not built for linux-aarch64 on conda-forge; complicating install for arm64)
  • text shaping should be an optional parameter set per font (may need to experiment with where it should be used)
    • actually, i just implemented it as set to True for now as I couldn't see any downsides to it so far (as per below)
  • support for right-to-left scripts like Arabic and Persian will require template adjustment
  • need to validate and correct auto-translations and formatting for all new languages, and new phrases for existing languages
@carlhiggs carlhiggs added the enhancement New feature or request label Apr 24, 2024
@carlhiggs carlhiggs self-assigned this Apr 24, 2024
carlhiggs referenced this issue Apr 24, 2024
…an, Hindi, Tamil script etc) via uharfbuzz cpython module; had to install gcc and g++ for this to work on the arm64 build; also updated Jupyter Lab start alias as per #399.  Also added new languages and auto-translations of these in anticipation of translation validation, in support of #367.  This image hasn't been fully tested yet, and more language features require implementation (e.g. right to left template support for Arabic and Persian; needs to be added as new issue)
@carlhiggs
Copy link
Member Author

After apparently successfully installing uharfbuzz, I timed the running of generating reports both without and with text shaping enabled. I did this after having already generated the image resources, so these were skipped over --- I was just interested in the impact on PDF generation for three languages that it wouldn't necessarily be expected to have a major impact on (English, Spanish, and Chinese - Simplified).

For the change, I added pdf.set_text_shaping(True) in the _pdf_initialise_document() function in _utils.py, after preparing pdf fonts; this is what it looked like before adding that change in:

def _pdf_initialise_document(phrases, config):
"""Initialise PDF document."""
pdf = FPDF(orientation='portrait', format='A4', unit='mm')
prepare_pdf_fonts(
pdf, config['reporting']['configuration'], config['pdf']['language'],
)
pdf.set_author(phrases['metadata_author'])
pdf.set_title(f"{phrases['metadata_title1']} {phrases['metadata_title2']}")
pdf.set_auto_page_break(False)
return pdf

From the GHSCI console I ran time generate example_es_las_palmas_2023, with the following results

Before:

real    1m3.214s
user    0m57.162s
sys     0m0.759s

After:

real    1m7.041s
user    1m0.987s
sys     0m0.750s

So, it took 4 seconds longer to produce 9 reports in 3 languages with text shaping compared to without. That's a pretty negligible difference in the scheme of things.

Was there an aesthetic difference (bearing in mind, I didn't really expect to see one with these fonts --- I just want to confirm there aren't adverse impacts)? I couldn't really notice any. On the left here is before, on the right is after (fwiw; no obvious at-a-glance change in Chinese or Latin script):

image

So.... what if we tried this with Hindi text?

Tried a few fonts, but ended up apparently mostly working with a recommended one from fpdf2 itself ---
This is before text shaping implementation:
image

and after:
image
there are missing glyphs in matplotlib plots --- something not working there, and remains unknown if this is actually working.

the text shaping version is different, but is it more correct? Fingers crossed!

carlhiggs added a commit that referenced this issue Apr 29, 2024
… reports in general with additional template refinements following feedback; also implemented a 'download_file()' function, currently only applied for fonts, it could be extended to be used for other dataset types, as per #418
carlhiggs added a commit that referenced this issue Apr 29, 2024
…ally verbose phrases that frequently fail, for #417
@carlhiggs carlhiggs mentioned this issue May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant