add impl UnicodeSegmentation for [u8] #46

BurntSushi · 2018-10-07T14:42:09Z

Would the maintainers be willing to accept an implementation of the trait for [u8]? Specifically, for possibly invalid UTF-8. We could specify that if invalid UTF-8 bytes are found, then we lossily emit a Unicode replacement codepoint.

cc @SimonSapin

The text was updated successfully, but these errors were encountered:

SimonSapin · 2018-10-07T15:23:39Z

In terms of implementation, this doesn’t look easy as the code seems to make pervasive use of &str.

In terms of API, I mildly oppose to silently doing replacement without users having to type the name of some items that contains the word lossy or similar. For example, this could mean implementing the trait for struct AsUtf8Lossy<'a>(&'a [u8]);.

But then we have trait methods that return itertors of &'a str or (usize, &'a str). The replacement characters "\u{FFFD}" is three bytes long, but it would replace byte sequences of variable length. This might require allocating a new string, whose ownership is unclear to me right now. But if we’re allocating, the caller might as well use String::from_utf8_lossy themselves and then the str implementation.

So instead, maybe the iterator types could be made generic to support both str and [u8]? They wouldn’t actually replace invalid byte sequence, but maybe only behave as if those sequences were U+FFFD. This alleviate my initial API concern, but makes implementation that much more involved.

By the way, in case you’re also tempted to do incremental processing where not all of the input bytes are in memory at the same time (or simply not in contiguous memory), note that segmentation does not "distribute" over concatenation: segment(a + b) does not always equal segment(a) + segment(b). Not only at the byte level since a single code point’s bytes could be split up in multiple input chunks, but even if you keep code points together a single grapheme cluster can be made of multiple code points from multiple chunks, for example.

BurntSushi · 2018-10-08T14:29:23Z

In terms of implementation, this doesn’t look easy as the code seems to make pervasive use of &str.

Yeah. It's been a while since I looked at the implementation in this crate, but I suspected as much. I think that's definitely part of my question here. Namely, if I were to implement this, and it resulted in a fairly pervasive change, is that something the maintainers would be willing to accept?

In terms of API, I mildly oppose to silently doing replacement without users having to type the name of some items that contains the word lossy or similar. For example, this could mean implementing the trait for struct AsUtf8Lossy<'a>(&'a [u8]);.

Hmm. OK. I don't think I strongly disagree with this.

But then we have trait methods that return itertors of &'a str or (usize, &'a str). The replacement characters "\u{FFFD}" is three bytes long, but it would replace byte sequences of variable length. This might require allocating a new string, whose ownership is unclear to me right now. But if we’re allocating, the caller might as well use String::from_utf8_lossy themselves and then the str implementation.

Interesting. Hmm. OK. I think I'm still playing catch up here, and I think there are parts that I don't quite grok yet. In particular, I have not yet deeply familiarized myself with the segmentation algorithm, so I might be missing some critical context here.

What I had in mind shouldn't require any allocations. In particular, if an invalid UTF-8 sequence is seen, then that sequence is skipped and "\u{FFFD}" is yielded for that iteration. I think I'm missing where you think the additional allocation comes in?

So instead, maybe the iterator types could be made generic to support both str and [u8]? They wouldn’t actually replace invalid byte sequence, but maybe only behave as if those sequences were U+FFFD. This alleviate my initial API concern, but makes implementation that much more involved.

This is interesting! So in this sense, the API for [u8] wouldn't be lossy at all, and iterations with invalid sequences would yield the invalid bytes themselves. At that point, it would be the caller's responsibility to do a second UTF-8 validation step if they wanted an &str.

I'm not sure which API is better. It's conceivable both could be desirable.

By the way, in case you’re also tempted to do incremental processing where not all of the input bytes are in memory at the same time (or simply not in contiguous memory), note that segmentation does not "distribute" over concatenation: segment(a + b) does not always equal segment(a) + segment(b). Not only at the byte level since a single code point’s bytes could be split up in multiple input chunks, but even if you keep code points together a single grapheme cluster can be made of multiple code points from multiple chunks, for example.

Right, yeah, that makes sense. I imagine if I were doing this, then I'd need a buffer until I knew I had a single complete grapheme.

To pop up a level to give you an idea of what I want to do: basically, in my work, I frequently deal with &[u8]. Ideally, that &[u8] is valid UTF-8, but of course, it may not be in some cases. The common scenario is to treat &[u8] as if it were valid UTF-8. This doesn't work with &str, obviously, so I need to stay in &[u8] land. I've been kind of skirting around the issue and making the regex engine deal with this for the most part, but difficulties pop up now and then. For example, ripgrep wants to grow a feature that limits the length of a line that it prints. Right now, that's implemented by simply comparing the total number of bytes, and if it exceeds some limit, replacing the entire line with a message that it was stripped. Ideally, we could show the first N characters in that line first. The best way I know to implement that is to decode the first N grapheme clusters and print those.

However, it's not feasible to do that, AFAIK, with the current API of this crate. I could do a lossy conversion of the entire line, but the line could be quite large (potentially arbitrarily large), so it would be much nicer to just decode what I need and be done with it.

More generally, I've been toying with the idea of creating a new string type that is "conventionally UTF-8, but not required to be," similar to the semantics of string in Go. I don't know yet whether it's a niche type or not, but honestly, it is very frequently the actual thing I want in my work. In order to make a type like that work with Unicode, we basically need lossy (or similar) APIs for everything that work on &[u8] or otherwise possibly valid UTF-8. So this issue is me trying to feel out how amenable our Unicode ecosystem is to this or whether I should just go off and prototype this on my own first.

I'd very much welcome thoughts you have on my overall plan here. :-) Apologies if it's a little side tracked from the specific issue here, but I figured giving some context on what I'm trying to do would help.

mbrubeck · 2018-10-08T19:46:06Z

Speaking as a sometime-contributor to this crate, I wouldn't object to the proposed implementation/API changes.

But I'd also note that this crate is not very actively maintained at all, and the fork in https://crates.io/crates/unic-segment might be a better place for new development. It hasn't been very actively developed either, but the "rust-unic" project as a whole has been more active recently than the "unicode-rs" project.

BurntSushi · 2018-10-08T20:24:18Z

@mbrubeck Yeah I noticed that. It's tough keeping up with maintenance on these types of crates; there's a high context switch overhead for diving into it (in my experience, anyway).

Thanks for pointing out unic-segment. I hadn't seen that.

If I do decide to move forward with this, I might just go ahead and prototype this in my own space, with the intent of hopefully merging efforts at some point in the future.

aochagavia · 2023-01-03T19:36:46Z

Just out of interest, did @BurntSushi or anyone else end up prototyping / implementing this somewhere? I have come across a use case where this might be useful.

Update: it looks like @BurntSushi went on and created bstr, which does segmentation too 😄

BurntSushi · 2023-01-03T19:52:03Z

@aochagavia Yeah, indeed. I also took a different implementation technique than unicode-segmentation and used regexes to generate offline DFAs that get compiled into the library. For example: https://github.com/BurntSushi/bstr/blob/86947727666d7b21c97eb16145b3ad6ac22aacd3/scripts/regex/grapheme.sh

I generally think my technique was mostly a failure. It works, but not as well as I'd hoped. The actual binary size is still big and I'm not sure it's meaningfully faster. ICU4X is probably a good project to look into if you want the best, and I believe it supports [u8].

As for APIs, I just exposed, e.g., graphemes and graphemes_indices on [u8] directly. As I mentioned above, the Unicode replacement codepoint is automatically substituted when invalid UTF-8 is seen, but the grapheme_indices API provides offsets into the original slice that permit accessing the original invalid bytes when necessary.

Manishearth · 2023-01-03T19:55:36Z

Also worth noting that icu_segmenter supports a bunch of different encodings (including invalid ones), and supports rule-based and dictionary-based segmentation: https://docs.rs/icu_segmenter

otoh it is probably not as fast as hardcoded segmenters.

LunarLambda mentioned this issue Feb 22, 2024

Grapheme segmentation is 1.2x-8x slower than unicode-segmentation in benchmarks BurntSushi/bstr#179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add impl UnicodeSegmentation for [u8] #46

add impl UnicodeSegmentation for [u8] #46

BurntSushi commented Oct 7, 2018

SimonSapin commented Oct 7, 2018

BurntSushi commented Oct 8, 2018 •

edited

mbrubeck commented Oct 8, 2018

BurntSushi commented Oct 8, 2018

aochagavia commented Jan 3, 2023 •

edited

BurntSushi commented Jan 3, 2023 •

edited

Manishearth commented Jan 3, 2023

add impl UnicodeSegmentation for [u8] #46

add impl UnicodeSegmentation for [u8] #46

Comments

BurntSushi commented Oct 7, 2018

SimonSapin commented Oct 7, 2018

BurntSushi commented Oct 8, 2018 • edited

mbrubeck commented Oct 8, 2018

BurntSushi commented Oct 8, 2018

aochagavia commented Jan 3, 2023 • edited

BurntSushi commented Jan 3, 2023 • edited

Manishearth commented Jan 3, 2023

BurntSushi commented Oct 8, 2018 •

edited

aochagavia commented Jan 3, 2023 •

edited

BurntSushi commented Jan 3, 2023 •

edited