Improved support for non-Latin language inclusion in report PDFs, and validate translations for new phrases and new languages #417

carlhiggs · 2024-04-24T13:02:12Z

When producing our 25 city reports, we experienced issues supporting some non-Latin scripts like Tamil:
global-healthy-liveable-cities/global_scorecards#7

However, the PDF templating software we use (fpdf2) has recently implemented changes that should provide better support for non-Latin languages:
https://py-pdf.github.io/fpdf2/Unicode.html#note-on-non-latin-languages

To take advantage of this though, we will need to implement and test changes to our software:

text shaping requires installation of uharfbuzz (which is not built for linux-aarch64 on conda-forge; complicating install for arm64)
text shaping should be an optional parameter set per font (may need to experiment with where it should be used)
- actually, i just implemented it as set to True for now as I couldn't see any downsides to it so far (as per below)
support for right-to-left scripts like Arabic and Persian will require template adjustment
need to validate and correct auto-translations and formatting for all new languages, and new phrases for existing languages

…an, Hindi, Tamil script etc) via uharfbuzz cpython module; had to install gcc and g++ for this to work on the arm64 build; also updated Jupyter Lab start alias as per #399. Also added new languages and auto-translations of these in anticipation of translation validation, in support of #367. This image hasn't been fully tested yet, and more language features require implementation (e.g. right to left template support for Arabic and Persian; needs to be added as new issue)

carlhiggs · 2024-04-24T15:04:43Z

After apparently successfully installing uharfbuzz, I timed the running of generating reports both without and with text shaping enabled. I did this after having already generated the image resources, so these were skipped over --- I was just interested in the impact on PDF generation for three languages that it wouldn't necessarily be expected to have a major impact on (English, Spanish, and Chinese - Simplified).

For the change, I added pdf.set_text_shaping(True) in the _pdf_initialise_document() function in _utils.py, after preparing pdf fonts; this is what it looked like before adding that change in:

global-indicators/process/subprocesses/_utils.py

Lines 1128 to 1137 in f8789eb

    
           def _pdf_initialise_document(phrases, config): 
        
               """Initialise PDF document.""" 
        
               pdf = FPDF(orientation='portrait', format='A4', unit='mm') 
        
               prepare_pdf_fonts( 
        
                   pdf, config['reporting']['configuration'], config['pdf']['language'], 
        
               ) 
        
               pdf.set_author(phrases['metadata_author']) 
        
               pdf.set_title(f"{phrases['metadata_title1']} {phrases['metadata_title2']}") 
        
               pdf.set_auto_page_break(False) 
        
               return pdf

From the GHSCI console I ran time generate example_es_las_palmas_2023, with the following results

Before:

real    1m3.214s
user    0m57.162s
sys     0m0.759s

After:

real    1m7.041s
user    1m0.987s
sys     0m0.750s

So, it took 4 seconds longer to produce 9 reports in 3 languages with text shaping compared to without. That's a pretty negligible difference in the scheme of things.

Was there an aesthetic difference (bearing in mind, I didn't really expect to see one with these fonts --- I just want to confirm there aren't adverse impacts)? I couldn't really notice any. On the left here is before, on the right is after (fwiw; no obvious at-a-glance change in Chinese or Latin script):

So.... what if we tried this with Hindi text?

Tried a few fonts, but ended up apparently mostly working with a recommended one from fpdf2 itself ---
This is before text shaping implementation:

and after:

there are missing glyphs in matplotlib plots --- something not working there, and remains unknown if this is actually working.

the text shaping version is different, but is it more correct? Fingers crossed!

… reports in general with additional template refinements following feedback; also implemented a 'download_file()' function, currently only applied for fonts, it could be extended to be used for other dataset types, as per #418

…ally verbose phrases that frequently fail, for #417

carlhiggs added the enhancement New feature or request label Apr 24, 2024

carlhiggs self-assigned this Apr 24, 2024

carlhiggs referenced this issue Apr 24, 2024

implemented text shaping in fpdf and further developed translations

129c564

carlhiggs mentioned this issue Apr 29, 2024

Prompt users to attempt download of configured data with URL when no file is present #418

Closed

carlhiggs added a commit that referenced this issue Apr 29, 2024

further fancied fallback filename formatting for languages with typic…

468b2b1

…ally verbose phrases that frequently fail, for #417

carlhiggs added a commit that referenced this issue Apr 30, 2024

updated report translations towards #417

104072f

carlhiggs mentioned this issue May 1, 2024

Allow generating a report in a new language, even if not configured for that city #420

Closed

carlhiggs mentioned this issue May 16, 2024

Enhancements #426

Merged

carlhiggs closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved support for non-Latin language inclusion in report PDFs, and validate translations for new phrases and new languages #417

Improved support for non-Latin language inclusion in report PDFs, and validate translations for new phrases and new languages #417

carlhiggs commented Apr 24, 2024 •

edited

carlhiggs commented Apr 24, 2024

Improved support for non-Latin language inclusion in report PDFs, and validate translations for new phrases and new languages #417

Improved support for non-Latin language inclusion in report PDFs, and validate translations for new phrases and new languages #417

Comments

carlhiggs commented Apr 24, 2024 • edited

carlhiggs commented Apr 24, 2024

carlhiggs commented Apr 24, 2024 •

edited