Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add char8_t mode #529

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add char8_t mode #529

wants to merge 1 commit into from

Conversation

gix
Copy link

@gix gix commented Nov 2, 2022

Overview

This PR adds a new mode to pugixml which uses C++20 char8_t instead of char for the UTF-8 interface. I've dubbed it char8 mode and it can be enabled using the PUGIXML_CHAR8_MODE option/macro.

Motivation

Using pugixml in a mixed ANSI/Unicode Windows codebase is quite errorprone. char is already used for the system codepage and it can easily happen that a UTF-8-as-char string slips through without conversion. I've found char8_t it to be a huge help in migrating stuff to UTF-8.

Implementation

Not much has to be changed. Since pugixml already calls C string functions with UTF-8, we can do the same with char8_t by just casting in a few places.

Stream-based methods received an additional overload, since the char overload may be used to represent arbitrary bytes, and the char8_t overload may be used by string streams.

An additional typedef u8char_t, which represents the type pugixml uses for a UTF-8 code unit, was added for the conversion functions.

Most changes had to be done in the test code. Representing raw bytes as string literals does not work for UTF-8 literals, since hex escape codes are interpreted as a Unicode character (and replaced by multiple bytes). Affected places either received a branch with a u8 literal or use a new RAW() macro which smuggles in UTF-8 code points using chars.

Notes

  • As implemented, the PUGIXML_CHAR8_MODE modifies the UTF-8 interface, which means PUGIXML_WCHAR_MODE takes precedence if both are defined (no diagnostic).
  • I haven't updated the CI build descriptions since I don't know the compiler versions. Locally I've tested with Visual Studio 2022 17.3 and clang 15.
  • Maybe an overload taking char8_t file paths should be added as well?

char8 mode, enabled using the PUGIXML_CHAR8_MODE macro, uses C++20 char8_t
instead of char for the UTF-8 interface. This makes use of pugixml safer
when char is otherwise used for the system codepage.

Stream-based methods received an additional overload, since the char
overload may be used to represent arbitrary bytes, and the char8_t overload
may be used by string streams.

An additional typedef u8char_t, which represents the type pugixml uses for
a UTF-8 code unit, was added for the conversion functions.

Most changes had to be done in the test code. Representing raw bytes as
string literals does not work for UTF-8 literals, since hex escape codes
are interpreted as a Unicode character. Affected places either received a
branch with a u8 literal or use a new RAW() macro which smuggles in UTF-8
code points using chars.
@brandl-muc
Copy link
Contributor

Just a question on the last note. (The rest sound good to me, thanks)
Are you really using or planning to use char8_t for paths? Especially on Windows where UTF-16 is used? There are no overloads for fopen() or std::fstream::open() AFAIK. Would not std::filesystem::path be the better type?

@@ -217,6 +217,8 @@ PUGI__NS_BEGIN

#ifdef PUGIXML_WCHAR_MODE
return wcslen(s);
#elif defined(PUGIXML_CHAR8_MODE)
return strlen(reinterpret_cast<const char*>(s));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is strlength not implemented in terms of std::char_traits<pugi::char_t>::length, like you did for the tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char_traits are only available when compiling with STL-support. The tests don't have that restriction.

@gix
Copy link
Author

gix commented Nov 16, 2023

Is there any chance of this PR ever progressing?

@zeux
Copy link
Owner

zeux commented Nov 16, 2023

While the changes in the PR itself feel reasonable, I'd like to better understand how widely this will be used. This adds a new compilation mode; for thorough testing it would require increasing number of tested build configurations by 50%. In some sense it's fine, in some other sense it's a notable increase in complexity. It's a C++20 feature so it's not obvious how widely available it would be - I would be more comfortable with some indication from multiple people about this being very valuable in their work.

I think something like this could be merged if there's sufficient demand for a new mode. If there's any way to reduce the differences in tests that would also be nice - this doesn't matter that much for the PR itself, but when adding new tests the wchar_t mode is already a problem because people tend to forget STR macros, so if any of the new macros could be replaced somehow that would be nice. It's not a big deal if they are necessary.

# endif
inline const char8_t* char_cast(const char* bytes)
{
ALIASING_BARRIER(bytes);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this addressing a specific problem? It's really hard to tell without any comments.

@zeux
Copy link
Owner

zeux commented Nov 16, 2023

I guess in a sense C++20 is a mixed blessing - the GHA builds will get noticeably longer but AppVeyor builds are much longer and are dominated by all old VS versions that won't support this anyway.

Probably the main thing that would make me comfortable with merging this would be more indication that this is going to be helpful for various users.

@zeux
Copy link
Owner

zeux commented Nov 16, 2023

A little related although that one needs a separate fix to also avoid the footgun in regular mode: #378

@gix
Copy link
Author

gix commented Nov 22, 2023

Probably the main thing that would make me comfortable with merging this would be more indication that this is going to be helpful for various users.

For one, it would make pugixml future proof and help adoption of char8_t. Many complaints about char8_t are about how useless it is since nothing supports it; a classic chicken-and-egg problem. User's on Linux have long misused char and treated it as always being UTF-8 but it's not portable (and might not even work on all Linux systems).

On Windows the situation is worse since char is almost never UTF-8 and just compiling with UNICODE does not make all the legacy APIs go away. The options are:

  • wchar_t-mode: Increased memory usage. Many unnecessary conversions. UTF-16 is not necessarily a good internal encoding even on Windows.
  • UTF-8 as char: Error-prone and unportable.
  • Convert to char8_t: Cumbersome since every use of pugixml looks like PugiToU8(node.attribute(U8ToPugi(name)).value()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants