Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strn*/utf8n* functions #105

Open
crt333 opened this issue Aug 3, 2022 · 5 comments
Open

strn*/utf8n* functions #105

crt333 opened this issue Aug 3, 2022 · 5 comments

Comments

@crt333
Copy link

crt333 commented Aug 3, 2022

The utf8len function returns codepoints instead of bytes, as expected, but it seems things like utf8ncmp continue to use bytes, which wasn't what I expected. Perhaps utf8ncmp could use n in codepoints too, and another utf8bcmp could use b bytes?

Not at all critical, since a work-around is easy, and I have no idea if others would want codepoint counting instead of bytes for the n functions. I needed an n codepoint compare, so I noticed this.

@sheredom
Copy link
Owner

sheredom commented Aug 3, 2022

I think I'd rather add utf8ccmp (codepoint compare?) - since we've already got n as a denotion for bytes as a relic from mimicing string.h. Thoughts?

@crt333
Copy link
Author

crt333 commented Aug 3, 2022

Ahh, good point, what I suggested would break existing code that uses the header. Yes, I think your utf8ccmp is good solution. I think many of the utf8n* functions could use a utf8c* version, though it may not make sense in all cases. Thanks for the speedy reply.

@xparq
Copy link

xparq commented Apr 14, 2023

Do I see it correctly that even utf8ncpy works with bytes (code units), too, instead of codepoints? That'd be a showstopper for me, unfortunately.

I think I'd rather add utf8ccmp (codepoint compare?) - since we've already got n as a denotion for bytes as a relic from mimicing string.h. Thoughts?

FWIW, I'd vote for a breaking change, perhaps accompanied by a filename change (like utf8str.h or u8str.h etc.), and do away with the legacy byte semantics for n, in favor of codepoints by default. (Or always? Are there compelling use cases when you'd want to iterate an UTF-8 string by byte?)

@Theldus
Copy link

Theldus commented Aug 17, 2023

Do I see it correctly that even utf8ncpy works with bytes (code units), too, instead of codepoints? That'd be a showstopper for me, unfortunately.

For me too.
I recently had to implement my own routine to copy up to N codepoints/characters, which worked fine (I think), but it would be really helpful if this already existed in a library such as utf8.h =).

and do away with the legacy byte semantics for n, in favor of codepoints by default. (Or always? Are there compelling use cases when you'd want to iterate an UTF-8 string by byte?)

This is something I agree with as well.
I believe the API is currently mixed, as evidenced by a man strlen on my system:

RETURN VALUE
The strlen() function returns the number of bytes in the string pointed to by s.

whereas utf8len() returns the number of codepoints, not the number of bytes.

Since the utf8*() routines always deal with utf8, I believe the 'n' parameter should always refer to codepoints/characters, rather than bytes. Any byte movements, the standard libc routines already handle.

@sheredom
Copy link
Owner

I think your arguments are probably right the more I've considered this. My hesitation has been that the n is generally used in strn* functions to say 'hey I only have these many bytes in this buffer!'. We can always add a b suffix like utf8b to mean bytes.

I can't commit to a timescale for this, but I'll try and work out a plan to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants