-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for 4 byte UTF-8 characters and HTML entities over  and 𘚟 #185
Comments

and 𘚟
can you attach a small file that illustrates the issue for us ? |
4 byte UTF-8 characters issueBelow you can see glyphs of Arial Unicode font. And here you see, how GD draws consecutive glyphs starting from You can clearly see, that characters after 0xFFFF are incorrectly parsed. Check UTF-8 byte representation of 0xFFFF and 0x10000 Unicode characters here: |
i'm not saying you're wrong ... i'm asking for a small program that we can build/run to check the behavior on our side so that we can develop a fix & testcase. that way we don't spend time recreating something you/someone else has already written, and we don't end up making changes for a different codepath w/out actually fixing the ones you're reporting. |
Adam Harvey already supplied a respective reproduce script including a respective font file and actual and expected images in the respective PHP bug report, and the first halve of the necessary bugfixes. I have now analyzed the issue there further. TL;DR: we have to also select the appropriate charmap (a font can have several of them). |
Specifically for emoji, we need to implement the ft flag support. As far as I remember we have another issue for that. |
@pierrejoye Indeed there is issue #184, which would be a nice enhancement. Wrt. this issue, it seems that in the long run we can simplify finding the desired charmap by just using The biggest hurdle I'm currently facing is finding a font which we can use for the test case. Adam's test uses code2001.ttf which is apparently freeware, but it's lacking a license file. |
Thanks, Mike! I'll give FontForge a try. |
Hello, this seems to be still unfixed? Any ETA? Thanks |
No, sorry. A PR including a regression test would be very welcome! |
I'd love to help! Is there a way we can discuss what needs to be done, without polluting the thread? :) |
any update of this in 2019? |
I've modified libgd and got it printing the entire 4 byte range now. |
Here is a patch to test. |
I refined the patch and extended it to include all html5 entities. FT_Set_Charmap(face, charmap); stops working when higher codepages are used here. I'll submit a pull request after further testing and would appreciate some comments. |
http://sterlingdesktops.com/pub/test/Issue-185-05.diff Edit latest: |
the HTML spec links to a JSON database: seems like that'd be a lot easier to parse than an ad-hoc HTML parser ? could even do it in Python pretty easily. |
Who actually runs the generation script/program ? If we are going to be completely thorough then we should check that there are no conflicts with unicode as well. IE: & as the first byte matching some of these entities. |
i agree we should stop installing entities.h. i'll do that for the next major release. this should only exist for our own gdft.c usage. the generation script is for us devs. we're committing the header file to the tree so we can ship it in releases. |
gd_Entity_To_Unicode should be moved to entities.c as well, but then that'd complicate the gen script we end up using. |
wrt entities.json, we can easily dedupe based on the trailing ; being omitted. similarly, i think we can normalize the entity names to lowercase and call it a day. |
"Entity names (the things that follow ampersands) are case-senstive, but many browsers will accept many of them entirely in uppercase or entirely in lowercase; a few must be cased in particular ways." I think you're right, we have to go with entities.json. 4 different ways of specifying the same thing in this one example: My original solution was just taking the first, I hadn't considered this. Also there can be two unicode codepoints per entity ie: NotLessLess ≪̸ |
gdImageStringFTEx will have to be modified to handle two codepoints returned. This will take some time. I will try not to flood this thread anymore untill I have something substantial. |
I would imagine python would be faster to do this in, so feel free to use that |
I dunno about that, i'm already printing all the key/value pairs as strings in C. I know perl better than I know python. perl regex would make quick work of it too. |
we're in the process of removing perl from the codebase ;) |
Ok, got a python generator working. |
Everything is in place at this point. It just needs testing. Things i've tested: I still need to test mixing of these as the function supports that and should autodetect. |
I think we are good with the PR. We need tests, but I can give @sterlingpickens a hand for this. @sterlingpickens did an amazing job here and libgd will be better with this addition. |
Yeah, I suggest to merge PR #695, and to go from there. |
Currently, GD supports only UTF-8 characters of 1-3 bytes. Four byte characters are unsupported. HTML entities higher than

and𘚟
are also unsupported.It makes impossible to draw emojis, which are Unicode characters higher than U+1F601. UTF-8 representation of emojis are at least 4 bytes long. I thought I could bypass this issue using HTML entities. However it seems, that they are bugged too.
Invalid code lies in
gdft.c
file,gdTcl_UtfToUniChar()
function.Sources:
gdft.c
gdft.c
Possibly related issue
Someone at PHP bugtracker suspects, that due to integer overflow, Unicode characters higher than 65536 might also be broken.
The text was updated successfully, but these errors were encountered: