Allow decoding of JSON messages in non-standard character sets #2093

codehead · 2023-08-13T05:19:24Z

Summary

This PR modifies Mojo::Message to allow JSON decoding when a charset is specified. Currently all messages are decoded as UTF-8, so JSON messages with other encodings may fail silently.

Motivation

Mojo::UserAgent always tries to decode messages in UTF-8 ignoring the message character set. When a message does not meet the UTF-8 specification, Mojo::JSON::json_decode() fails silently and Mojo::UserAgent::json() returns undef. In character sets with a large overlap with UTF-8 such as ISO-8859-1, message decoding fails only when accented characters are present, so JSON messages migh seem empty at random.
This is critical to interface with legacy systems that expose JSON messages in charsets different from UTF-8.

References

No public issues or PRs for Mojolicious as far as I'm aware.
Other Perl frameworks allow non-UTF-8 charsets for JSON messages from very early on -- e.g. Catalyst::View::JSON.
Some Java frameworks defaulted to ISO-8859-1 encoding as recently as 2020: Content Type being append with charset=ISO-8859-1 #1428.
Even though RFC4627 tried to stardardize UTF-8 encoding for JSON messages as far back as 2006 (at around the same time some frameworks implemented charset specification to allow non-ascii characters in JSON) RFC8259's language leaves enough slack for systems that are part of a closed ecosystem.

…g silently.

kraih · 2023-08-13T12:00:59Z

lib/Mojo/Message.pm

@@ -132,7 +132,8 @@ sub is_limit_exceeded { !!shift->{limit} }
 sub json {
  my ($self, $pointer) = @_;
  return undef if $self->content->is_multipart;
-  my $data = $self->{json} //= j($self->body);
+  my $charset = $self->body_params->charset || $self->default_charset;


I wonder what the performance cost of this is.

You are right, $self->content->charset is the way to go.

kraih · 2023-08-13T12:02:07Z

lib/Mojo/Message.pm

@@ -132,7 +132,8 @@ sub is_limit_exceeded { !!shift->{limit} }
 sub json {
  my ($self, $pointer) = @_;
  return undef if $self->content->is_multipart;
-  my $data = $self->{json} //= j($self->body);
+  my $charset = $self->body_params->charset || $self->default_charset;
+  my $data    = $self->{json} //= j($charset eq 'UTF-8' ? $self->body : encode('UTF-8', decode($charset, $self->body)));


j(encode(decode())) seems super inefficient.

Agreed. Tried this before:

my $data = $charset eq 'UTF-8' ? j($self->body) : from_json(decode($charset,$self->body));

which seems more natural but breaks the last test in t/mojo/response.t subtest "Parse response and extract JSON data"

t/mojo/response.t .......................... 1/? # Failed test 'right result' # at t/mojo/response.t line 1075. # Structures begin differing at: # $got->[1] = '2' # $expected->[1] = '4' # Looks like you failed 1 test of 9. # Failed test 'Parse response and extract JSON data' # at t/mojo/response.t line 1076. # Looks like you failed 1 test of 56.

i.e. this one:

is_deeply $res->json('/baz'), [1, 2, 3], 'right result'; $res->json->{baz}[1] = 4; is_deeply $res->json('/baz'), [1, 4, 3], 'right result'; # <-- This one here

j(encode(decode())) is awkward but avoids that subtle breakage.

I might be missing something, so all suggestions are welcome.

jberger · 2023-08-13T12:48:31Z

I'm out of town and reading on my phone so I'm not going to do a full review, but I think I agree with @kraih there are things in there that don't pass the smell test

Grinnz · 2023-08-13T17:06:01Z

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

codehead · 2023-08-13T22:24:47Z

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent failure.

Grinnz · 2023-08-13T22:29:36Z

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

I don't see the comparison. The only purpose of ->text is to decode from the specified charset. The only purpose of ->json is to decode from spec-compliant JSON.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent fail.

There are plenty of issues one may run into in real world use cases, the framework has to balance predicting these with making the common and compliant case simple and efficient, in this case calling two functions seems like a simple option compared to complicating and slowing down the ->json method for all users in the common case where you receive JSON.

jberger · 2023-08-16T15:20:59Z

Use of ->json is optional. In my opinion if you have a legacy system with nonstandard JSON encoding, you should just use from_json(decode($charset, $msg->body))

Thank you @Grinnz . With all due respect, use of ->text is also optional, and it does take into account the charset specified in the message and falls back to a reasonable value should decoding fail. And that syntactic sugar is very, very welcome. I hope the same can be achieved with ->json in the name of cross-framework, cross-language, cross-charset interoperability.

I don't see the comparison. The only purpose of ->text is to decode from the specified charset. The only purpose of ->json is to decode from spec-compliant JSON.

The thing is, you don't know you are dealing with this issue (I still refuse it to call it a bug) until it bites you. On a random message. That might be irreproducible. From a system you don't control. IMHO it would be more useful to have json() die loudly upon trying to decode a non-UTF-8 message than a silent fail.

There are plenty of issues one may run into in real world use cases, the framework has to balance predicting these with making the common and compliant case simple and efficient, in this case calling two functions seems like a simple option compared to complicating and slowing down the ->json method for all users in the common case where you receive JSON.

I concur with @Grinnz, there is a difference in that "text" is specified to be in a known (configured) charset, but JSON is by definition utf-8 encoded. As such, bytes that are json-like but encoded as some other charset are technically not actually JSON. I know that's not a very satisfactory answer, and it sounds dismissive, but given the difference there it seems not unreasonable to make the spec-compliant mechanism as simple and fast as possible while still making your use-case possible, if not quite as convenient. It seems that that is already the case.

kraih · 2023-08-26T11:33:26Z

I'm afraid it looks like this PR will not pass the vote.

Decode JSON messages in non-standard character sets instead of failin…

6b926df

…g silently.

kraih reviewed Aug 13, 2023

View reviewed changes

kraih requested review from a team, marcusramberg, kraih, jberger and Grinnz August 13, 2023 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow decoding of JSON messages in non-standard character sets #2093

Allow decoding of JSON messages in non-standard character sets #2093

codehead commented Aug 13, 2023

kraih Aug 13, 2023

codehead Aug 13, 2023

kraih Aug 13, 2023

codehead Aug 13, 2023 •

edited

jberger commented Aug 13, 2023

Grinnz commented Aug 13, 2023

codehead commented Aug 13, 2023 •

edited

Grinnz commented Aug 13, 2023

jberger commented Aug 16, 2023

kraih commented Aug 26, 2023

Allow decoding of JSON messages in non-standard character sets #2093

Are you sure you want to change the base?

Allow decoding of JSON messages in non-standard character sets #2093

Conversation

codehead commented Aug 13, 2023

Summary

Motivation

References

kraih Aug 13, 2023

Choose a reason for hiding this comment

codehead Aug 13, 2023

Choose a reason for hiding this comment

kraih Aug 13, 2023

Choose a reason for hiding this comment

codehead Aug 13, 2023 • edited

Choose a reason for hiding this comment

jberger commented Aug 13, 2023

Grinnz commented Aug 13, 2023

codehead commented Aug 13, 2023 • edited

Grinnz commented Aug 13, 2023

jberger commented Aug 16, 2023

kraih commented Aug 26, 2023

codehead Aug 13, 2023 •

edited

codehead commented Aug 13, 2023 •

edited