Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 #185

Open
stil opened this issue Sep 9, 2015 · 30 comments · May be fixed by #695
Open

Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 #185

stil opened this issue Sep 9, 2015 · 30 comments · May be fixed by #695
Labels

Comments

@stil
Copy link

stil commented Sep 9, 2015

Currently, GD supports only UTF-8 characters of 1-3 bytes. Four byte characters are unsupported. HTML entities higher than  and 𘚟 are also unsupported.

It makes impossible to draw emojis, which are Unicode characters higher than U+1F601. UTF-8 representation of emojis are at least 4 bytes long. I thought I could bypass this issue using HTML entities. However it seems, that they are bugged too.

Invalid code lies in gdft.c file, gdTcl_UtfToUniChar() function.

Sources:

Possibly related issue

Someone at PHP bugtracker suspects, that due to integer overflow, Unicode characters higher than 65536 might also be broken.

The responses to this report seem to imply that ImageTtfText WIIL work correctly for characters in the range U+10000 - U+1869F (65536 - 99999). However, this does not seem to be the case. When I call it with 𐀁 (U+10001), it produces an "missing" glyph, even using a font that contains the character (CODE2001.ttf).

@stil stil changed the title Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 Sep 9, 2015
@vapier vapier added the bug label Sep 9, 2015
@vapier
Copy link
Member

vapier commented Sep 9, 2015

can you attach a small file that illustrates the issue for us ?

@stil
Copy link
Author

stil commented Sep 9, 2015

4 byte UTF-8 characters issue

Below you can see glyphs of Arial Unicode font.

arial_unicode

And here you see, how GD draws consecutive glyphs starting from 0xFFD0 and ending at 0x1001F.

drawing_glyphs

You can clearly see, that characters after 0xFFFF are incorrectly parsed.

Check UTF-8 byte representation of 0xFFFF and 0x10000 Unicode characters here:

@vapier
Copy link
Member

vapier commented Sep 9, 2015

i'm not saying you're wrong ... i'm asking for a small program that we can build/run to check the behavior on our side so that we can develop a fix & testcase. that way we don't spend time recreating something you/someone else has already written, and we don't end up making changes for a different codepath w/out actually fixing the ones you're reporting.

@cmb69
Copy link
Contributor

cmb69 commented Aug 22, 2016

Adam Harvey already supplied a respective reproduce script including a respective font file and actual and expected images in the respective PHP bug report, and the first halve of the necessary bugfixes. I have now analyzed the issue there further. TL;DR: we have to also select the appropriate charmap (a font can have several of them).

@cmb69 cmb69 self-assigned this Aug 22, 2016
@pierrejoye
Copy link
Contributor

Specifically for emoji, we need to implement the ft flag support. As far as I remember we have another issue for that.

@cmb69
Copy link
Contributor

cmb69 commented Aug 23, 2016

@pierrejoye Indeed there is issue #184, which would be a nice enhancement.

Wrt. this issue, it seems that in the long run we can simplify finding the desired charmap by just using FT_Select_Charmap() instead. For GD-2.2.x (and maybe GD-2.x.y) I would suggest to go with a minimal fix, namely to use FT_SelectCharmap() only if encoding is detected as gdFTEX_Unicode.

The biggest hurdle I'm currently facing is finding a font which we can use for the test case. Adam's test uses code2001.ttf which is apparently freeware, but it's lacking a license file.

@vapier
Copy link
Member

vapier commented Aug 23, 2016

you could always take an existing free font and hack it up with something like FontForge or Birdfont. we don't need the glyphs to be "correct", just unique so we can compare them against a reference image.

@cmb69
Copy link
Contributor

cmb69 commented Sep 9, 2016

Thanks, Mike! I'll give FontForge a try.

@0biWanKenobi
Copy link

Hello, this seems to be still unfixed? Any ETA? Thanks

@cmb69
Copy link
Contributor

cmb69 commented Mar 6, 2018

Any ETA?

No, sorry. A PR including a regression test would be very welcome!

@0biWanKenobi
Copy link

0biWanKenobi commented Mar 8, 2018

A PR including a regression test would be very welcome!

I'd love to help! Is there a way we can discuss what needs to be done, without polluting the thread? :)
A basic patch, as proposed here and mentioned above, would be to just extend the number of chars that the function is parsing.. Is that not enough? If there's more to be done, I can try helping, I just need some direction :)

@pentium10
Copy link

any update of this in 2019?

@sterlingpickens
Copy link

I've modified libgd and got it printing the entire 4 byte range now.
I'm still testing and am unsure about the html5 entities.
Also, I noticed xshow don't seem to account for the horiBearingX, is this intentional ie: meant to return the glyph origin ?

@sterlingpickens
Copy link

Here is a patch to test.
Issue-185-01.diff

@sterlingpickens
Copy link

sterlingpickens commented Apr 20, 2021

I refined the patch and extended it to include all html5 entities.
http://sterlingdesktops.com/pub/test/Issue-185-02.diff
I just need to do a complete test to verify that everything works.
Also, we need to possibly use a different charmap when there is no glyph present. This patch it hardcoding to let freetype decide with: FT_Select_Charmap(face, FT_ENCODING_UNICODE);

FT_Set_Charmap(face, charmap); stops working when higher codepages are used here.

I'll submit a pull request after further testing and would appreciate some comments.

@sterlingpickens
Copy link

sterlingpickens commented Apr 20, 2021

http://sterlingdesktops.com/pub/test/Issue-185-05.diff
I verified that this one works with all the different html/utf input formats.

Edit latest:
http://sterlingdesktops.com/pub/test/Issue-185-06.diff

@sterlingpickens sterlingpickens linked a pull request Apr 23, 2021 that will close this issue
@vapier
Copy link
Member

vapier commented Apr 24, 2021

the HTML spec links to a JSON database:
https://html.spec.whatwg.org/entities.json

seems like that'd be a lot easier to parse than an ad-hoc HTML parser ? could even do it in Python pretty easily.

@sterlingpickens
Copy link

Who actually runs the generation script/program ?
I also worry that /usr/include/entities.h is not a good idea since libgd is the only thing using it(I could be wrong).
If this is the case, then we should compile a entities.h/.c only and not install it as a header.
The best solution might be to include a header that is already on most people's systems from a different project.
Also, that entities.json will more than double the array members because of case sensitivity and ";".

If we are going to be completely thorough then we should check that there are no conflicts with unicode as well. IE: & as the first byte matching some of these entities.

@vapier
Copy link
Member

vapier commented Apr 25, 2021

i agree we should stop installing entities.h. i'll do that for the next major release. this should only exist for our own gdft.c usage.

the generation script is for us devs. we're committing the header file to the tree so we can ship it in releases.

@sterlingpickens
Copy link

gd_Entity_To_Unicode should be moved to entities.c as well, but then that'd complicate the gen script we end up using.

@vapier
Copy link
Member

vapier commented Apr 25, 2021

wrt entities.json, we can easily dedupe based on the trailing ; being omitted. similarly, i think we can normalize the entity names to lowercase and call it a day.

@sterlingpickens
Copy link

sterlingpickens commented Apr 25, 2021

"Entity names (the things that follow ampersands) are case-senstive, but many browsers will accept many of them entirely in uppercase or entirely in lowercase; a few must be cased in particular ways."

I think you're right, we have to go with entities.json.

4 different ways of specifying the same thing in this one example:
Dot die DoubleDot uml

My original solution was just taking the first, I hadn't considered this.
I'll give it some thought and try to put something together tomorrow.

Also there can be two unicode codepoints per entity ie: NotLessLess ≪̸

@sterlingpickens
Copy link

gdImageStringFTEx will have to be modified to handle two codepoints returned.
struct entities_s must have two ints to hold all these.

This will take some time. I will try not to flood this thread anymore untill I have something substantial.
I'm playing with json-c right now.

@vapier
Copy link
Member

vapier commented Apr 25, 2021

I would imagine python would be faster to do this in, so feel free to use that

@sterlingpickens
Copy link

I dunno about that, i'm already printing all the key/value pairs as strings in C.
It seems most the functions are just macros. The fastest might just be a bash script tbh.

I know perl better than I know python. perl regex would make quick work of it too.

@vapier
Copy link
Member

vapier commented Apr 25, 2021

we're in the process of removing perl from the codebase ;)

@sterlingpickens
Copy link

Ok, got a python generator working.
I need to modify gdft.c some more now

@sterlingpickens
Copy link

sterlingpickens commented Apr 27, 2021

Everything is in place at this point. It just needs testing.

Things i've tested:
All single codepage unicode given as utf8 (one per function call same image).
All single entity string/hex/dec (one per function call same image).
A string containing all utf8 encoded unicode for various fonts (single function call).
A string containing all entity strings (single function call).

I still need to test mixing of these as the function supports that and should autodetect.
Also I need to test that the JISX0208 or any non-unicode has not been broken.

#695

@pierrejoye
Copy link
Contributor

I think we are good with the PR. We need tests, but I can give @sterlingpickens a hand for this. @sterlingpickens did an amazing job here and libgd will be better with this addition.

@vapier @cmb69 merge and I can help then, thought?

@cmb69
Copy link
Contributor

cmb69 commented Dec 15, 2021

Yeah, I suggest to merge PR #695, and to go from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants