Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 encoding woes #64

Open
martint17r opened this issue Jan 23, 2018 · 4 comments
Open

UTF-8 encoding woes #64

martint17r opened this issue Jan 23, 2018 · 4 comments

Comments

@martint17r
Copy link

while using tap-pipedrive I noticed that the output produced - ultimately by format_message in messages.py is using simplejson with the default value of ensure_ascii=True - is encoded in Pythons escaped unicode (literal \u followed by 4 hexadecimal digits).

This confuses a lot of my later processing. I am not sure how to properly fix that later on.

I set PYTHONIOENCODING to utf-8 and it looks like the setting is working:

$ python -c'import sys; print(sys.stdout.encoding)'
utf8

The output from tap-pipedrive is unchanged, though.

A way to change the output encoding is to set ensure_ascii=False when calling simplejson.dumps. Would you accept a PR for that?

@KAllan357
Copy link
Contributor

KAllan357 commented Jan 23, 2018

Hi @martint17r - thanks for the issue.

I'm curious why this hasn't been a problem for any other consumers of this library. I am a little concerned about the downstream effects of such a change. I suspect there is no greater issue because our target-stitch library uses the corresponding parse_message function which is able to consume the ASCII output using json's loads.

I'm not sure if its relevant in this case, but in the past - when PYTHONIOENCODING was not enough - I've had to set the LANG environment variable described in this issue docker-library/python#13.

In any case, I'd be happy to merge a PR for this change if you make the default function work as it has worked but provide a flag to use your described behavior.

@martint17r
Copy link
Author

I believe the reason why this has not popped up anywhere else is that this is not a problem within python, i.e. the escaped unicode notation is a Python specific encoding, see 7.2 Python specific encodings, which it tolerates.

I ran into that problem, because I am trying to implement a target in Go.

According to RFC 8259 Ch. 8.1 JSON must be encoded in UTF-8, even RFC 7493 Ch. 2.1 agrees on that.

Would you accept a second pull request to PROPOSALS.md to enhance the SPEC accordingly, i.e. define a MUST-HAVE UTF-8 encoding for all data? Something along the lines of:
"A Tap outputs structured messages to stdout in JSON format, one message per line, UTF-8 encoded".

I agree that this mandates a major version change in semver notation.

@KAllan357
Copy link
Contributor

This makes sense and both parts sound like a good change. I've done some preliminary testing and Python seems happy to decode both the escaped and not-escaped versions of the string.

Making it an option seems unnecessary - if you could make a PR to add the flag and we can do a major bump to 6.0.0. Adding to the Singer Spec sounds good too. Thanks for this!

(note: this SO question helped me understand the issue and there seems to be less confusion in Python 3 than 2 - the dumps function no longer returns a string OR unicode object when this flag is present)

@ashutoshvarma
Copy link

Any update on this? For now I am using ensure_ascii=False locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants