Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 #185

stil · 2015-09-09T13:32:32Z

Currently, GD supports only UTF-8 characters of 1-3 bytes. Four byte characters are unsupported. HTML entities higher than  and 𘚟 are also unsupported.

It makes impossible to draw emojis, which are Unicode characters higher than U+1F601. UTF-8 representation of emojis are at least 4 bytes long. I thought I could bypass this issue using HTML entities. However it seems, that they are bugged too.

Invalid code lies in gdft.c file, gdTcl_UtfToUniChar() function.

Sources:

Possibly related issue

Someone at PHP bugtracker suspects, that due to integer overflow, Unicode characters higher than 65536 might also be broken.

The responses to this report seem to imply that ImageTtfText WIIL work correctly for characters in the range U+10000 - U+1869F (65536 - 99999). However, this does not seem to be the case. When I call it with 𐀁 (U+10001), it produces an "missing" glyph, even using a font that contains the character (CODE2001.ttf).

The text was updated successfully, but these errors were encountered:

vapier · 2015-09-09T16:09:31Z

can you attach a small file that illustrates the issue for us ?

stil · 2015-09-09T20:26:43Z

4 byte UTF-8 characters issue

Below you can see glyphs of Arial Unicode font.

And here you see, how GD draws consecutive glyphs starting from 0xFFD0 and ending at 0x1001F.

You can clearly see, that characters after 0xFFFF are incorrectly parsed.

Check UTF-8 byte representation of 0xFFFF and 0x10000 Unicode characters here:

vapier · 2015-09-09T20:54:01Z

i'm not saying you're wrong ... i'm asking for a small program that we can build/run to check the behavior on our side so that we can develop a fix & testcase. that way we don't spend time recreating something you/someone else has already written, and we don't end up making changes for a different codepath w/out actually fixing the ones you're reporting.

cmb69 · 2016-08-22T14:36:15Z

Adam Harvey already supplied a respective reproduce script including a respective font file and actual and expected images in the respective PHP bug report, and the first halve of the necessary bugfixes. I have now analyzed the issue there further. TL;DR: we have to also select the appropriate charmap (a font can have several of them).

pierrejoye · 2016-08-23T01:34:18Z

Specifically for emoji, we need to implement the ft flag support. As far as I remember we have another issue for that.

cmb69 · 2016-08-23T09:06:36Z

@pierrejoye Indeed there is issue #184, which would be a nice enhancement.

Wrt. this issue, it seems that in the long run we can simplify finding the desired charmap by just using FT_Select_Charmap() instead. For GD-2.2.x (and maybe GD-2.x.y) I would suggest to go with a minimal fix, namely to use FT_SelectCharmap() only if encoding is detected as gdFTEX_Unicode.

The biggest hurdle I'm currently facing is finding a font which we can use for the test case. Adam's test uses code2001.ttf which is apparently freeware, but it's lacking a license file.

vapier · 2016-08-23T18:49:43Z

you could always take an existing free font and hack it up with something like FontForge or Birdfont. we don't need the glyphs to be "correct", just unique so we can compare them against a reference image.

cmb69 · 2016-09-09T18:12:43Z

Thanks, Mike! I'll give FontForge a try.

0biWanKenobi · 2018-02-27T22:10:27Z

Hello, this seems to be still unfixed? Any ETA? Thanks

cmb69 · 2018-03-06T22:16:32Z

Any ETA?

No, sorry. A PR including a regression test would be very welcome!

0biWanKenobi · 2018-03-08T21:14:18Z

A PR including a regression test would be very welcome!

I'd love to help! Is there a way we can discuss what needs to be done, without polluting the thread? :)
A basic patch, as proposed here and mentioned above, would be to just extend the number of chars that the function is parsing.. Is that not enough? If there's more to be done, I can try helping, I just need some direction :)

pentium10 · 2019-10-29T14:48:32Z

any update of this in 2019?

sterlingpickens · 2021-04-17T13:26:42Z

I've modified libgd and got it printing the entire 4 byte range now.
I'm still testing and am unsure about the html5 entities.
Also, I noticed xshow don't seem to account for the horiBearingX, is this intentional ie: meant to return the glyph origin ?

sterlingpickens · 2021-04-17T14:29:47Z

Here is a patch to test.
Issue-185-01.diff

sterlingpickens · 2021-04-20T18:48:32Z

I refined the patch and extended it to include all html5 entities.
http://sterlingdesktops.com/pub/test/Issue-185-02.diff
I just need to do a complete test to verify that everything works.
Also, we need to possibly use a different charmap when there is no glyph present. This patch it hardcoding to let freetype decide with: FT_Select_Charmap(face, FT_ENCODING_UNICODE);

FT_Set_Charmap(face, charmap); stops working when higher codepages are used here.

I'll submit a pull request after further testing and would appreciate some comments.

sterlingpickens · 2021-04-20T23:38:14Z

http://sterlingdesktops.com/pub/test/Issue-185-05.diff
I verified that this one works with all the different html/utf input formats.

Edit latest:
http://sterlingdesktops.com/pub/test/Issue-185-06.diff

vapier · 2021-04-24T23:40:20Z

the HTML spec links to a JSON database:
https://html.spec.whatwg.org/entities.json

seems like that'd be a lot easier to parse than an ad-hoc HTML parser ? could even do it in Python pretty easily.

sterlingpickens · 2021-04-25T00:24:32Z

Who actually runs the generation script/program ?
I also worry that /usr/include/entities.h is not a good idea since libgd is the only thing using it(I could be wrong).
If this is the case, then we should compile a entities.h/.c only and not install it as a header.
The best solution might be to include a header that is already on most people's systems from a different project.
Also, that entities.json will more than double the array members because of case sensitivity and ";".

If we are going to be completely thorough then we should check that there are no conflicts with unicode as well. IE: & as the first byte matching some of these entities.

vapier · 2021-04-25T00:52:18Z

i agree we should stop installing entities.h. i'll do that for the next major release. this should only exist for our own gdft.c usage.

the generation script is for us devs. we're committing the header file to the tree so we can ship it in releases.

sterlingpickens · 2021-04-25T02:13:26Z

gd_Entity_To_Unicode should be moved to entities.c as well, but then that'd complicate the gen script we end up using.

vapier · 2021-04-25T04:51:10Z

wrt entities.json, we can easily dedupe based on the trailing ; being omitted. similarly, i think we can normalize the entity names to lowercase and call it a day.

sterlingpickens · 2021-04-25T05:14:37Z

"Entity names (the things that follow ampersands) are case-senstive, but many browsers will accept many of them entirely in uppercase or entirely in lowercase; a few must be cased in particular ways."

I think you're right, we have to go with entities.json.

4 different ways of specifying the same thing in this one example:
Dot die DoubleDot uml

My original solution was just taking the first, I hadn't considered this.
I'll give it some thought and try to put something together tomorrow.

Also there can be two unicode codepoints per entity ie: NotLessLess ≪̸

sterlingpickens · 2021-04-25T21:27:27Z

gdImageStringFTEx will have to be modified to handle two codepoints returned.
struct entities_s must have two ints to hold all these.

This will take some time. I will try not to flood this thread anymore untill I have something substantial.
I'm playing with json-c right now.

vapier · 2021-04-25T21:30:28Z

I would imagine python would be faster to do this in, so feel free to use that

sterlingpickens · 2021-04-25T21:38:46Z

I dunno about that, i'm already printing all the key/value pairs as strings in C.
It seems most the functions are just macros. The fastest might just be a bash script tbh.

I know perl better than I know python. perl regex would make quick work of it too.

vapier · 2021-04-25T21:51:45Z

we're in the process of removing perl from the codebase ;)

sterlingpickens · 2021-04-26T01:16:46Z

Ok, got a python generator working.
I need to modify gdft.c some more now

sterlingpickens · 2021-04-27T00:14:44Z

Everything is in place at this point. It just needs testing.

Things i've tested:
All single codepage unicode given as utf8 (one per function call same image).
All single entity string/hex/dec (one per function call same image).
A string containing all utf8 encoded unicode for various fonts (single function call).
A string containing all entity strings (single function call).

I still need to test mixing of these as the function supports that and should autodetect.
Also I need to test that the JISX0208 or any non-unicode has not been broken.

#695

pierrejoye · 2021-09-04T11:06:26Z

I think we are good with the PR. We need tests, but I can give @sterlingpickens a hand for this. @sterlingpickens did an amazing job here and libgd will be better with this addition.

@vapier @cmb69 merge and I can help then, thought?

cmb69 · 2021-12-15T12:00:53Z

Yeah, I suggest to merge PR #695, and to go from there.

stil changed the title Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 Sep 9, 2015

vapier added the bug label Sep 9, 2015

cmb69 self-assigned this Aug 22, 2016

GaryAllan mentioned this issue Apr 23, 2018

Racks and device display unrecognizable characters in chinese phpipam/phpipam#1904

Closed

sterlingpickens linked a pull request Apr 23, 2021 that will close this issue

Issue-185 changes #695

Open

cmb69 removed their assignment Dec 15, 2021

damianwadley mentioned this issue Aug 24, 2023

imagettftext() How to display Unicode Plane 1 characters php/doc-en#2705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 #185

Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 #185

stil commented Sep 9, 2015

vapier commented Sep 9, 2015

stil commented Sep 9, 2015

vapier commented Sep 9, 2015

cmb69 commented Aug 22, 2016

pierrejoye commented Aug 23, 2016

cmb69 commented Aug 23, 2016

vapier commented Aug 23, 2016

cmb69 commented Sep 9, 2016

0biWanKenobi commented Feb 27, 2018

cmb69 commented Mar 6, 2018

0biWanKenobi commented Mar 8, 2018 •

edited

pentium10 commented Oct 29, 2019

sterlingpickens commented Apr 17, 2021

sterlingpickens commented Apr 17, 2021

sterlingpickens commented Apr 20, 2021 •

edited

sterlingpickens commented Apr 20, 2021 •

edited

vapier commented Apr 24, 2021

sterlingpickens commented Apr 25, 2021

vapier commented Apr 25, 2021

sterlingpickens commented Apr 25, 2021

vapier commented Apr 25, 2021

sterlingpickens commented Apr 25, 2021 •

edited

sterlingpickens commented Apr 25, 2021

vapier commented Apr 25, 2021

sterlingpickens commented Apr 25, 2021

vapier commented Apr 25, 2021

sterlingpickens commented Apr 26, 2021

sterlingpickens commented Apr 27, 2021 •

edited

pierrejoye commented Sep 4, 2021

cmb69 commented Dec 15, 2021

Add support for 4 byte UTF-8 characters and HTML entities over &#xFFFF; and &#99999; #185

Add support for 4 byte UTF-8 characters and HTML entities over &#xFFFF; and &#99999; #185

Comments

stil commented Sep 9, 2015

Possibly related issue

vapier commented Sep 9, 2015

stil commented Sep 9, 2015

4 byte UTF-8 characters issue

vapier commented Sep 9, 2015

cmb69 commented Aug 22, 2016

pierrejoye commented Aug 23, 2016

cmb69 commented Aug 23, 2016

vapier commented Aug 23, 2016

cmb69 commented Sep 9, 2016

0biWanKenobi commented Feb 27, 2018

cmb69 commented Mar 6, 2018

0biWanKenobi commented Mar 8, 2018 • edited

pentium10 commented Oct 29, 2019

sterlingpickens commented Apr 17, 2021

sterlingpickens commented Apr 17, 2021

sterlingpickens commented Apr 20, 2021 • edited

sterlingpickens commented Apr 20, 2021 • edited

vapier commented Apr 24, 2021

sterlingpickens commented Apr 25, 2021

vapier commented Apr 25, 2021

sterlingpickens commented Apr 25, 2021

vapier commented Apr 25, 2021

sterlingpickens commented Apr 25, 2021 • edited

sterlingpickens commented Apr 25, 2021

vapier commented Apr 25, 2021

sterlingpickens commented Apr 25, 2021

vapier commented Apr 25, 2021

sterlingpickens commented Apr 26, 2021

sterlingpickens commented Apr 27, 2021 • edited

pierrejoye commented Sep 4, 2021

cmb69 commented Dec 15, 2021

Add support for 4 byte UTF-8 characters and HTML entities over and 𘚟 #185

Add support for 4 byte UTF-8 characters and HTML entities over and 𘚟 #185

0biWanKenobi commented Mar 8, 2018 •

edited

sterlingpickens commented Apr 20, 2021 •

edited

sterlingpickens commented Apr 20, 2021 •

edited

sterlingpickens commented Apr 25, 2021 •

edited

sterlingpickens commented Apr 27, 2021 •

edited