Add char8_t mode #529

gix · 2022-11-02T07:51:21Z

Overview

This PR adds a new mode to pugixml which uses C++20 char8_t instead of char for the UTF-8 interface. I've dubbed it char8 mode and it can be enabled using the PUGIXML_CHAR8_MODE option/macro.

Motivation

Using pugixml in a mixed ANSI/Unicode Windows codebase is quite errorprone. char is already used for the system codepage and it can easily happen that a UTF-8-as-char string slips through without conversion. I've found char8_t it to be a huge help in migrating stuff to UTF-8.

Implementation

Not much has to be changed. Since pugixml already calls C string functions with UTF-8, we can do the same with char8_t by just casting in a few places.

Stream-based methods received an additional overload, since the char overload may be used to represent arbitrary bytes, and the char8_t overload may be used by string streams.

An additional typedef u8char_t, which represents the type pugixml uses for a UTF-8 code unit, was added for the conversion functions.

Most changes had to be done in the test code. Representing raw bytes as string literals does not work for UTF-8 literals, since hex escape codes are interpreted as a Unicode character (and replaced by multiple bytes). Affected places either received a branch with a u8 literal or use a new RAW() macro which smuggles in UTF-8 code points using chars.

Notes

As implemented, the PUGIXML_CHAR8_MODE modifies the UTF-8 interface, which means PUGIXML_WCHAR_MODE takes precedence if both are defined (no diagnostic).
I haven't updated the CI build descriptions since I don't know the compiler versions. Locally I've tested with Visual Studio 2022 17.3 and clang 15.
Maybe an overload taking char8_t file paths should be added as well?

char8 mode, enabled using the PUGIXML_CHAR8_MODE macro, uses C++20 char8_t instead of char for the UTF-8 interface. This makes use of pugixml safer when char is otherwise used for the system codepage. Stream-based methods received an additional overload, since the char overload may be used to represent arbitrary bytes, and the char8_t overload may be used by string streams. An additional typedef u8char_t, which represents the type pugixml uses for a UTF-8 code unit, was added for the conversion functions. Most changes had to be done in the test code. Representing raw bytes as string literals does not work for UTF-8 literals, since hex escape codes are interpreted as a Unicode character. Affected places either received a branch with a u8 literal or use a new RAW() macro which smuggles in UTF-8 code points using chars.

brandl-muc · 2022-11-02T08:45:56Z

Just a question on the last note. (The rest sound good to me, thanks)
Are you really using or planning to use char8_t for paths? Especially on Windows where UTF-16 is used? There are no overloads for fopen() or std::fstream::open() AFAIK. Would not std::filesystem::path be the better type?

Tradias · 2022-11-02T16:46:25Z

src/pugixml.cpp

@@ -217,6 +217,8 @@ PUGI__NS_BEGIN

 	#ifdef PUGIXML_WCHAR_MODE
 		return wcslen(s);
+	#elif defined(PUGIXML_CHAR8_MODE)
+		return strlen(reinterpret_cast<const char*>(s));


Why is strlength not implemented in terms of std::char_traits<pugi::char_t>::length, like you did for the tests?

char_traits are only available when compiling with STL-support. The tests don't have that restriction.

gix · 2023-11-16T10:10:20Z

Is there any chance of this PR ever progressing?

zeux · 2023-11-16T16:17:41Z

While the changes in the PR itself feel reasonable, I'd like to better understand how widely this will be used. This adds a new compilation mode; for thorough testing it would require increasing number of tested build configurations by 50%. In some sense it's fine, in some other sense it's a notable increase in complexity. It's a C++20 feature so it's not obvious how widely available it would be - I would be more comfortable with some indication from multiple people about this being very valuable in their work.

I think something like this could be merged if there's sufficient demand for a new mode. If there's any way to reduce the differences in tests that would also be nice - this doesn't matter that much for the PR itself, but when adding new tests the wchar_t mode is already a problem because people tend to forget STR macros, so if any of the new macros could be replaced somehow that would be nice. It's not a big deal if they are necessary.

zeux · 2023-11-16T16:20:37Z

tests/test.hpp

+#	endif
+inline const char8_t* char_cast(const char* bytes)
+{
+	ALIASING_BARRIER(bytes);


Is this addressing a specific problem? It's really hard to tell without any comments.

zeux · 2023-11-16T16:42:00Z

I guess in a sense C++20 is a mixed blessing - the GHA builds will get noticeably longer but AppVeyor builds are much longer and are dominated by all old VS versions that won't support this anyway.

Probably the main thing that would make me comfortable with merging this would be more indication that this is going to be helpful for various users.

zeux · 2023-11-16T16:53:42Z

A little related although that one needs a separate fix to also avoid the footgun in regular mode: #378

gix · 2023-11-22T08:25:00Z

Probably the main thing that would make me comfortable with merging this would be more indication that this is going to be helpful for various users.

For one, it would make pugixml future proof and help adoption of char8_t. Many complaints about char8_t are about how useless it is since nothing supports it; a classic chicken-and-egg problem. User's on Linux have long misused char and treated it as always being UTF-8 but it's not portable (and might not even work on all Linux systems).

On Windows the situation is worse since char is almost never UTF-8 and just compiling with UNICODE does not make all the legacy APIs go away. The options are:

wchar_t-mode: Increased memory usage. Many unnecessary conversions. UTF-16 is not necessarily a good internal encoding even on Windows.
UTF-8 as char: Error-prone and unportable.
Convert to char8_t: Cumbersome since every use of pugixml looks like PugiToU8(node.attribute(U8ToPugi(name)).value()).

Tradias reviewed Nov 2, 2022

View reviewed changes

zeux reviewed Nov 16, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add char8_t mode #529

Add char8_t mode #529

gix commented Nov 2, 2022

brandl-muc commented Nov 2, 2022

Tradias Nov 2, 2022

gix Nov 5, 2022

gix commented Nov 16, 2023

zeux commented Nov 16, 2023

zeux Nov 16, 2023

zeux commented Nov 16, 2023 •

edited

zeux commented Nov 16, 2023

gix commented Nov 22, 2023

Add char8_t mode #529

Are you sure you want to change the base?

Add char8_t mode #529

Conversation

gix commented Nov 2, 2022

Overview

Motivation

Implementation

Notes

brandl-muc commented Nov 2, 2022

Tradias Nov 2, 2022

Choose a reason for hiding this comment

gix Nov 5, 2022

Choose a reason for hiding this comment

gix commented Nov 16, 2023

zeux commented Nov 16, 2023

zeux Nov 16, 2023

Choose a reason for hiding this comment

zeux commented Nov 16, 2023 • edited

zeux commented Nov 16, 2023

gix commented Nov 22, 2023

zeux commented Nov 16, 2023 •

edited