Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

edit: datum struct string type added utf8 check #1488

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dust1
Copy link
Contributor

@dust1 dust1 commented Feb 26, 2024

Rationale

Close #1300

Detailed Changes

Check whether it is a utf8 string when inserting data

Test Plan

pass

@dust1 dust1 changed the title Datum struct string type added utf8 check edit: ddtum struct string type added utf8 check Feb 26, 2024
@dust1 dust1 changed the title edit: ddtum struct string type added utf8 check edit: datum struct string type added utf8 check Feb 26, 2024
@dust1
Copy link
Contributor Author

dust1 commented Feb 27, 2024

I forgot, I'll try adding a few more unit tests later

@@ -765,6 +776,11 @@ impl Datum {
}
}

fn valid_is_utf8(s: &str) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check may be expensive for long string, better to add an option to decide whether to do this check.

@@ -765,6 +776,11 @@ impl Datum {
}
}

fn valid_is_utf8(s: &str) -> Result<()> {
from_utf8(s.as_bytes()).context(InvalidStringEncoding { msg: s })?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this function should return bool, not a result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@dust1
Copy link
Contributor Author

dust1 commented Feb 27, 2024

I checked the rust official documentation, and for the way to build Datum objects from String in datum.rs, rust guarantees that String is a utf8 string, which I might need to modify elsewhere. 😢

@jiacai2050
Copy link
Contributor

rust guarantees that String is a utf8 string, which I might need to modify elsewhere. 😢

Yes, I grep the code and find several place contains from_bytes_unchecked(bytes: Bytes).

As for debugging this issue, you can construct a GBK string using SDK, and trace why there is no error for it.

@dust1
Copy link
Contributor Author

dust1 commented Mar 4, 2024

rust guarantees that String is a utf8 string, which I might need to modify elsewhere. 😢

Yes, I grep the code and find several place contains from_bytes_unchecked(bytes: Bytes).

As for debugging this issue, you can construct a GBK string using SDK, and trace why there is no error for it.

Ok, I'll try

@dust1
Copy link
Contributor Author

dust1 commented Mar 14, 2024

The from_bytes_unchecked function will only be called when decoding. I think what I should be looking for is why non-UTF8 characters are saved when encoding. I'll find out later😵

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet argument error: Parquet error: encountered non UTF-8 data
2 participants