Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fo unicode and octal escapes in string literals. #65

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wagjo
Copy link

@wagjo wagjo commented May 25, 2014

Specs do not mention whether unicode and octal escapes are supported or not. As clojure.edn supports it [1], I've added an explicit mention in the specs. I'm a registered clojure contributor (signed CA).

[1] https://github.com/clojure/clojure/blob/c6756a8bab137128c8119add29a25b0a88509900/src/jvm/clojure/lang/EdnReader.java#L580

@avodonosov
Copy link

avodonosov commented Apr 22, 2020

@richhickey, the absence of unicode escapes in string literals is really limiting. And the reason for that is unclear, given that unicode escapes are supported for characters.

bpsm added a commit to bpsm/edn-java that referenced this pull request Apr 25, 2020
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")
bpsm added a commit to bpsm/edn-java that referenced this pull request May 1, 2020
This is in response to edn-format/edn#65 .

This is an extension as string literals as currently documented
do not specify support for \uXXXX escapes.

  https://github.com/edn-format/edn/tree/a51127aecd318096667ae0dafa25353ecb07c9c3

Syntax Notes:

- Unicode escape must begin with "\u". This is case sensitive "\U" will
  be rejected.
- "\u" must be followed by exactly four hex digits taken from this set:
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
- The digits are not case sensitive.
- Each such Unicode escape encodes a single 16-bit Java char. Since Java
  uses UTF-16 internally (for historical reasons) code points beyond
  the basic multilingual plane as a pair of unicode escapes.
  (see also "surrogate pairs")

Disabling:

By default \uXXXX escapes are now supported in String literals.

Parser.Config (and Parser.Config.Builder) now support a flag which can
be set to false to disable support for \uXXXX in string literals. This
restores the old behavior of throwing an EdnSyntaxException when such
escapes are encountered.
@avodonosov
Copy link

The maintainer of edn-java library kindly agreed to implement unicode escapes in the library. Initially, it was planned as an option, disabled by default. After implementing it that way it was discovered that https://github.com/clojure/tools.reader supports unicode escapes by default, so edn-java finally implemented unicode escapes enabled by default.

Turns out https://github.com/clojure/tools.reader also supports octal escapes in string and character literals, same as in the clojure languate. (The current edn spec includes unicode escapes for characters, but misses octal escapes).

@richhickey IMHO clarity is needed in the spec. It's strange unicode escapes are not specified for strings while they are specified for characters. And what about octal escapes?

@wagjo, if your pull requests includes octal escapes for string litertals, makes sense to include them for characters tool (the clojure language and the tools.reader support them in the form \oNNN).

As for backwards compatibility, I would suggest to include the escapes into the spec and add a comment: "Unicode and octal escapes in string literals and octal escapes in character literals were only added to the spec in 2020. Some implementations supported them before that. For compatibility, consumers of EDN documents (including parsing libraries) should always support the escapes. The suppliers of EDN documents should avoid the escapes, unless they verified all the consumers of their documents support the escapes"

@avodonosov
Copy link

avodonosov commented May 3, 2020

BTW, in Java octal escapes in string literals can contain up to 3 digits (https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html), while the clojure reader and the clojure.tools.reader.edn require exactly 3 digits after backlash.

So @wagjo, the wording "as in Java" in the pull request does not match precisely the current implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants