Add UTF-8 validity checking to schema #151

KBorders01 · 2021-09-08T13:42:27Z

For data-type "string", the _transform function just attempts to do str(data) and catches an exception to determine if the string is valid. Binary strings with null bytes or other invalid UTF-8 character sequences will pass through this function as valid strings. However, targets may expect strings to be valid encoded text, such as UTF-8.

UTF-8 encoding validation can be enforced with a pre_hook when calling transform, but this doesn't inform the target about the type of string. It'd be helpful to somehow include character encoding as part of the schema so that downstream targets can know what to expect and choose the appropriate data type. For example, MySQL has TEXT and BLOB types to separately handle text and binary strings. One natural place to put this could be the "format" parameter, though it'd be tedious to have to explicitly specify UTF-8 for every string when that is the default. It'd be convenient to have a way to make UTF-8 the default for all strings in a schema and override it with binary (the current behavior) explicitly for binary fields.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UTF-8 validity checking to schema #151

Add UTF-8 validity checking to schema #151

KBorders01 commented Sep 8, 2021

Add UTF-8 validity checking to schema #151

Add UTF-8 validity checking to schema #151

Comments

KBorders01 commented Sep 8, 2021