Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] Support XML 1.0r5 and above #711

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mazer1310
Copy link

Proposed Fix for Issue #171
[XML 1.0r5/1.1] XML Parser does not respect valid XML Start and Name chars from XML specification 1.0r5 and after

In nametab.h, this replaces the previous opaque lookup tables with simplified nameChar and startNameChar maps that correspond to the XML specification at https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-common-syn

xmltok.c is updated to take advantage of the change. Additionally, as characters > 0xFFFF are now allowed as XML characters, 4-byte characters are now set to isAlways rather than isNever

runtests.c test_utf8_in_start_tags and test_long_utf8_character were updated to test the new character ranges.

@hildjj
Copy link

hildjj commented May 9, 2023

This now lets expat pass all of the xmlconf conformance tests for "XML1.0-errata4e" except for rmt-e2e-50, which requires a v1.1 parser, oddly.

@mazer1310
Copy link
Author

except for rmt-e2e-50, which requires a v1.1 parser, oddly.
Looking at https://www.w3.org/XML/Test/xmlconf-20080827.html, that particular test looks like it is for normalization of line endings, so not surprising that this change would not address that since this was focused on startName and Name chars.

The XML specs for end-of-line handling:
1.1:
https://www.w3.org/TR/xml11/#sec-line-ends
1.0:
https://www.w3.org/TR/xml/#sec-line-ends

1.0 specifies that #xD#xA and #xD are converted to #xA
1.1 specifies that #xD#xA, #xD#x85, #x85, #x2028, and #xD characters and sequences are converted to #xA

@hartwork
Copy link
Member

hartwork commented May 9, 2023

@mazer1310 @hildjj what exactly is going on here? #171 (comment) clearly stated that work in this area is not considered relevant (and hence does not deserve anyone's time), recent #171 (comment) made clear that that's still status quo and rather than arguing or explaining why that would have changed over the course of the last 20 years, you start a pull request about it. I'm not sure what exactly to make of that. And one of the commits here says I would be the author which I am not. What is going on here?

@mazer1310
Copy link
Author

@mazer1310 @hildjj what exactly is going on here? #171 (comment) clearly stated that work in this area is not considered relevant (and hence does not deserve anyone's time), recent #171 (comment) made clear that that's still status quo and rather than arguing or explaining why that would have changed over the course of the last 20 years, you start a pull request about it. I'm not sure what exactly to make of that. And one of the commits here says I would be the author which I am not. What is going on here?

As for the commit that says you are the author, I have no idea whatsoever what happened there. git did something strange because I had two commits that should have both come from me, but clearly something went wonky with the check-in. I forked (fork = mazer1310/libexpat), created a branch on my fork (support-xml-1.0r5-and-above), submitted the changes on that branch, pushed to github, then opened the pull request. I have no idea how your name got associated with any of those actions. It should have been tied solely to my github userid. I re-edited my submission comment to bring it in line with libexpat historical comments and perhaps that caused something screwy to occur? I'll look to see how to correct that in git.

As for what is going on here, I've been personally patching libexpat within my organization's product since 2017 in order to address the problems we encountered reading XML documents, especially in Asia-Pacific. The code changes I submitted have been in production for us for several years successfully and I had not been good about sharing those changes with others that might benefit from it. Therefore, I figured I would finally get off my butt and share that with anyone else who relies on libexpat as we do. In my experience, libexpat is one of the best performing and adaptive parsers available, and I would like to see it continue to be excellent.

I was just trying to be helpful.

@hartwork
Copy link
Member

@mazer1310 I appreciate your reply. Thank you.

@hartwork hartwork changed the title Support xml 1.0r5 and above [draft] Support XML 1.0r5 and above May 11, 2023
@hartwork hartwork marked this pull request as draft December 13, 2023 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants