Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParsingException on unicode U+FFFF character #254

Open
anilkumarmyla opened this issue Mar 16, 2018 · 5 comments
Open

ParsingException on unicode U+FFFF character #254

anilkumarmyla opened this issue Mar 16, 2018 · 5 comments

Comments

@anilkumarmyla
Copy link

Self explanatory with following code

Welcome to Scala 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162).
Type in expressions for evaluation. Or try :help.

scala> import spray.json._
import spray.json._

scala> val a: String = """{"hello":"a\uFFFFworld"}"""
a: String = {"hello":"a�world"}

scala> a.parseJson
spray.json.JsonParser$ParsingException: Unexpected end-of-input at input index 11 (line 1, position 12), expected '"':
{"hello":"a
           ^

  at spray.json.JsonParser.fail(JsonParser.scala:217)
  at spray.json.JsonParser.require(JsonParser.scala:200)
  at spray.json.JsonParser.string(JsonParser.scala:148)
  at spray.json.JsonParser.value(JsonParser.scala:67)
  at spray.json.JsonParser.members$1(JsonParser.scala:85)
  at spray.json.JsonParser.object(JsonParser.scala:90)
  at spray.json.JsonParser.value(JsonParser.scala:64)
  at spray.json.JsonParser.parseJsValue(JsonParser.scala:46)
  at spray.json.JsonParser.parseJsValue(JsonParser.scala:42)
  at spray.json.JsonParser$.apply(JsonParser.scala:28)
  at spray.json.RichString.parseJson(package.scala:50)
  ... 36 elided

scala> a.getBytes
res1: Array[Byte] = Array(123, 34, 104, 101, 108, 108, 111, 34, 58, 34, 97, -17, -65, -65, 119, 111, 114, 108, 100, 34, 125)

scala> a.replaceAll("\uFFFF", "").parseJson
res2: spray.json.JsValue = {"hello":"aworld"}

scala> a.replaceAll("\uFFFF", "").getBytes
res3: Array[Byte] = Array(123, 34, 104, 101, 108, 108, 111, 34, 58, 34, 97, 119, 111, 114, 108, 100, 34, 125)

scala> 
@ramanmishra
Copy link

can you please share your build.sbt. or sprayJson library version.

@anilkumarmyla
Copy link
Author

can you please share your build.sbt. or sprayJson library version.

happens with the latest version - 1.3.4

@jrudolph
Copy link
Member

Hi @anilkumarmyla, that's so by design (but could be documented better). According to the unicode standard \uffff is a non-character that is reserved for "process-internal" usages. That's exactly how it is used inside of spray-json: It designates the end of input.

@plokhotnyuk
Copy link

IFYK:

"Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.

For up-to-date Unicode implementations, however, one should use caution when choosing sentinel values. U+FFFF and U+10FFFF still have interesting numerical properties which render them likely choices for internal use as sentinels, but implementers should be aware of the fact that those values, as for all noncharacters in the standard, are also valid in Unicode strings, must be converted between UTFs, and may be encountered in Unicode data—not necessarily used with the same interpretation as for one's own sentinel use. Just be careful out there!"

http://www.unicode.org/faq/private_use.html#sentinel6

@jrudolph
Copy link
Member

Thanks for the added information. There's also the paragraph directly before:

Unicode 4.0 also added an entire new informative section about noncharacters, which recommended the use of U+FFFF and U+10FFFF "for internal purposes as sentinels." That new text also stated that "[noncharacters] are forbidden for use in open interchange of Unicode text data," a claim which was stronger than the formal definition. And it made a contrast between noncharacters and "valid character value[s]", implying that noncharacters were not valid. Of course, noncharacters could not be interpreted in open interchange, but the text in this section had not really caught up with the implications of the change of wording in the conformance requirements for UTFs. The text still echoed the sense of "invalid" associated with noncharacters in Unicode 3.0.

So, yes it's complicated but I also think it's arguably still a good enough solution right now. Let's reopen to add a note to the documentation that those code points are not supported by the parser.

@jrudolph jrudolph reopened this Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants