Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read_str method to JsonReader's StringValueReader #11

Open
Marcono1234 opened this issue Sep 9, 2023 · 0 comments
Open

Add read_str method to JsonReader's StringValueReader #11

Marcono1234 opened this issue Sep 9, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@Marcono1234
Copy link
Owner

Marcono1234 commented Sep 9, 2023

Problem solved by the enhancement

The Read returned by JsonReader::next_string_reader will always provide valid UTF-8 bytes (assuming there is no bug in a JsonReader implementation). It might therefore be convenient if it was possible to directly obtain str pieces from it. Given that the data should be valid UTF-8, this might also allow unchecked UTF-8 conversion to further improve performance.

Enhancement description

📢 This is only a proposal and not a planned feature yet. If you are interested in this, add a 👍 or feel free to comment (also with any suggestions and feedback).


Add a new StringValueReader trait (similarly named as the existing StringValueWriter trait), and let JsonReader::next_string_reader return that.

Add a read_str method to the trait which takes a &[u8] as input and returns a &str backed by it as output. Since StringValueReader should already provide valid UTF-8 data through its read method, this new read_str method could have the following default implementation:

// TODO doc: Returned str might be shorter than `buf` if EOF has been reached (possibly even empty str),
// or if last UTF-8 char bytes might not have fit completely into `buf`
// TODO: This implementation should probably retry reading in case of `ErrorKind::Interrupted`,
// and document this behavior then, similar to `StringValueWriter::write_str`
fn read_str<'a>(&mut self, buf: &'a mut [u8]) -> Result<&'a str, IoError> {
    let buf_len = buf.len();
    if buf_len < utf8::MAX_BYTES_PER_CHAR {
        panic!("Buffer is too small");
    }

    // Reserve space at end of buf to make sure no incomplete UTF-8 data is read, and
    // all missing continuation bytes of last char (if any) still have enough space
    let fill_buf = &mut buf[..(buf_len - (utf8::MAX_BYTES_PER_CHAR - 1))];
    debug_assert!(!fill_buf.is_empty());

    // Read once to determine if previous `read` left incomplete UTF-8 data in buffer; in that case fail fast
    let mut pos = self.read(fill_buf)?;
    if pos == 0 {
        // Reached EOF, return empty string
        return Ok("");
    } else if utf8::is_continuation(fill_buf[0]) {
        return Err(IoError::new(
            ErrorKind::InvalidData,
            "previous read left incomplete UTF-8 data",
        ));
    }

    while pos < fill_buf.len() {
        let read = self.read(&mut fill_buf[pos..])?;
        if read == 0 {
            break;
        }
        pos += read;
    }

    // Find position where UTF-8 bytes of last char start
    let mut last_char_start_pos = pos;
    while utf8::is_continuation(buf[last_char_start_pos]) {
        // This should be overflow safe, because StringValueReader should always provide valid UTF-8 bytes,
        // and due to `is_continuation` check at beginning of method
        last_char_start_pos -= 1;
    }

    let last_char_start_byte = buf[last_char_start_pos];
    let mut missing_bytes = if utf8::is_1byte(last_char_start_byte) {
        0
    } else if utf8::is_2byte_start(last_char_start_byte) {
        1
    } else if utf8::is_3byte_start(last_char_start_byte) {
        2
    } else if utf8::is_4byte_start(last_char_start_byte) {
        3
    } else {
        // StringValueReader should always provide valid UTF-8 bytes
        panic!("Unexpected: Malformed UTF-8 bytes in buffer: {buf:02X?}");
    };
    // Subtract number of continuation bytes which are already in buffer
    missing_bytes = usize::checked_sub(missing_bytes, pos - last_char_start_pos)
        // StringValueReader should always provide valid UTF-8 bytes, but apparently it provided too many continuation bytes
        .expect("Unexpected: Malformed UTF-8 bytes in buffer");

    // If necessary, read remaining bytes of last char
    if missing_bytes > 0 {
        self.read_exact(&mut buf[pos..(pos + missing_bytes)])?;
        pos += missing_bytes;
    }

    // This should be safe because StringValueReader should only return valid UTF-8 bytes
    // TODO: Maybe this is too risky for custom JsonReader implementations, where it cannot be guaranteed
    // that they are bug free and only return valid UTF-8 data
    let string = utf8::to_str_unchecked(&buf[..pos]);
    Ok(string)
}

⚠️ This code has not been tested yet!

It might then also be possible to override Read::read_to_string to use read_str instead for the string pieces to avoid UTF-8 validation overhead; though this might not be worth it, especially since the default read_to_string implementation seems to be highly optimized already.

Alternatives / workarounds

  • Don't add any new method and let user handle the UTF-8 conversion
    • at the risk that users would have to re-implement this themselves and might have bugs in their implementation
    • requiring the users to perform unchecked UTF-8 conversion in their code, if desired
@Marcono1234 Marcono1234 added the enhancement New feature or request label Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant