Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement UnicodeSegmentation for Iterator<item = char> #28

Open
cbarrick opened this issue Jun 18, 2017 · 6 comments
Open

Implement UnicodeSegmentation for Iterator<item = char> #28

cbarrick opened this issue Jun 18, 2017 · 6 comments

Comments

@cbarrick
Copy link

It would be nice to segment character iterators, especially for interoperability with the unicode-normalization crate. This could provide a solution to #7 when/if io::Chars stabilizes. In particular, I'd like to write a tokenizer like this:

let input: BufRead = my_input();
let tokens = input.chars().nfkc().split_word_bounds();

One issue I see is that most of the public structs provide an as_str method that returns "the underlying data (the part yet to be iterated) as a slice of the original string". This obviously won't work with streaming types.

@ghost
Copy link

ghost commented Jul 14, 2017

I will work on this.

@ghost
Copy link

ghost commented Jul 14, 2017

@CryZe can you expain your CharOrBoundary idea?

@CryZe
Copy link

CryZe commented Jul 14, 2017

The problem is that unicode-segmentation is built around borrowing str slices from some owned data, like a String. That doesn't work well with char iterators, as those aren't stored anywhere. So to support this at all, you would need to introduce a separate streaming API, that provides Iterator<Item = CharOrBoundary> iterators with CharOrBoundary being

pub enum CharOrBoundary {
    Char(char),
    Boundary,
}

So you can then iterate over the characters and it'll tell you whether you hit a boundary or not. You could then have other helper functions that help you collect char segments between the boundaries into buffers.

@HadrienG2
Copy link

+1 to this idea, would also be useful when working with non-UTF8 strings in legacy APIs.

@HadrienG2
Copy link

To clarify, I think that providing the full UnicodeSegmentation on top of Iterator<Item=char> is hopeless because the existing UnicodeSegmentation API heavily assumes access to an underlying &str all over the place, and in the case of an Iterator<Item=char> there may not be one.

What we could provide, however, is something that turns an Iterator<Item=char> into an Iterator<Item=Iterator<Item=char>> of sorts that represents graphemes or words (may need to be a streaming iterator if we don't want to impose a Clone bound on the underlying Iterator, or if we want to avoid parsing the text twice).

Since GraphemeCursor::next_boundary() already works on top of a char iterator, it might be possible to rewrite it in terms of this API in order to avoid code duplication. For words, it's less clear how to proceed, as the implementation makes even more UTF-8 string assumptions, such as manipulating string indices under the hood.

@HadrienG2
Copy link

HadrienG2 commented Jan 12, 2019

I looked into it further and tried to adapt parts of the GraphemeCursor implementation to streaming use cases. From this experiment, it seems to me that it is impossible to provide both of the following API properties at the same time while keeping the implementation sane:

  1. Ability to work with an incomplete view of the input string and add more of it as needed.
  2. Ability to work with streams of char.

The reason is that in the current API, extra input is "patched together" with existing one using UTF-8 indices as a unifying abstraction, and AFAIK there is no nice equivalent in an Iterator<Item=char> world. An incomplete replacement would be some ability to attach an extra iterator at the end of the existing one using Iterator::chain(), but that would not address the full generality of the current API, which can also work with overlapping chunks of UTF-8 (though whether one would want to ever use them is up for debate).

So unless I'm missing something obvious, it seems to me that the least bad option is to stick with the existing code, collect the iterator of chars into a (possibly truncated) UTF-8 string and use unicode_segmentation on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants