Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF-8 validity checking to schema #151

Open
KBorders01 opened this issue Sep 8, 2021 · 0 comments
Open

Add UTF-8 validity checking to schema #151

KBorders01 opened this issue Sep 8, 2021 · 0 comments

Comments

@KBorders01
Copy link

For data-type "string", the _transform function just attempts to do str(data) and catches an exception to determine if the string is valid. Binary strings with null bytes or other invalid UTF-8 character sequences will pass through this function as valid strings. However, targets may expect strings to be valid encoded text, such as UTF-8.

UTF-8 encoding validation can be enforced with a pre_hook when calling transform, but this doesn't inform the target about the type of string. It'd be helpful to somehow include character encoding as part of the schema so that downstream targets can know what to expect and choose the appropriate data type. For example, MySQL has TEXT and BLOB types to separately handle text and binary strings. One natural place to put this could be the "format" parameter, though it'd be tedious to have to explicitly specify UTF-8 for every string when that is the default. It'd be convenient to have a way to make UTF-8 the default for all strings in a schema and override it with binary (the current behavior) explicitly for binary fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant