Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize json with control characters #473

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

trifle
Copy link

@trifle trifle commented Oct 15, 2022

Hi!

I recently saw that VK seems to be really bad at sanitizing some fields such as usernames.
Loading a bunch of members with vk_api yielded an ugly
JSONDecodeError: Invalid control character at: line 1 column 68248 (char 68247)

Inspecting this by hand, I found that the response contains this abomination:

In [32]: e[68200:68259]
Out[32]: ',"is_closed":false},{"id":27497241,"nickname":"\x01","domain":'

Of course, <0x01> is not a codepoint that you'd want in a nickname, ever!

Since the data is nonsensical, not printable and potentially dangerous, my suggestion would be to catch JSONDecodeErrors and try to sanitize the raw content before parsing it with json.loads.

There are many options for replacing the problematic characters, but regex should be reasonably fast. As a bonus, it allows us to catch control characters as a category (\p{C}) rather than listing them by hand.

@python273
Copy link
Owner

might be better to use strict=False instead of regexp?

https://docs.python.org/3/library/json.html#json.JSONDecoder

@trifle
Copy link
Author

trifle commented Oct 15, 2022

Hi @python273, thanks for taking the time to look into it!

I've actually thought about using that flag: As far as I read it, this simply allows the json decoder to pass on this invalid data.
But I don't think it's best to use that here, because the resulting data would almost certainly cause problems for people. I mean, is there any valid use case where you would want a null byte or similar in a VK field to process further?

I also feel that delimiters are another special case. As far as I see it, having "\r" or "\n" in json data (unescaped!) means that almost certainly something on the sending end is broken. Because all tooling that works with newlines will be broken, and you essentially can't rely on any of the data to remain sensible (i.e. where do you stop parsing? Just kill everything before the newline?).

A final consideration is the maintainability. From that perspective I could see why using the json decoder flag is more attractive.

What do you think?

@python273
Copy link
Owner

python273 commented Oct 15, 2022

I reported the bug to VK, hopefully they just fix their JSON encoder to escape control chars ('\x01' as '\\u0001')

Not sure about regexp, since it alters the content of the response, so let's see what VK replies

@trifle
Copy link
Author

trifle commented Oct 15, 2022

Thanks, that's a great idea.
I've encountered way too many double- or under-encoded data, one less is certainly a good idea.
I'll keep using my fork in the meantime, so feel free to close this PR.
BTW, there still is a typo, the import is named re not rx, I'll change that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants