Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible parsing ambiguity: attribute key starting with punctuation #242

Open
faelys opened this issue Sep 11, 2023 · 7 comments
Open

Possible parsing ambiguity: attribute key starting with punctuation #242

faelys opened this issue Sep 11, 2023 · 7 comments

Comments

@faelys
Copy link

faelys commented Sep 11, 2023

Hello,

I'm still discovering the syntax and trying to understand the existing parser, please let me know if I missed something.

As far as I understand, attribute keys and bare value allow _, :, and - anywhere, including in the first character of a key and the last character of the value.

Therefore:

  • word{_key=value_} would mean <span _key="_value">word</span>,
  • word{_keyvalue_} would mean word<em>keyvalue</em>,
  • getting word<em>key=value</em> would require something like word{_key\=value_}?

I find it not very satisfying, that = having a potentially very long range, and the overall (admittedly contrieved) construct being hard to disambiguate with the brain.

The samething happens with - instead of _, replacing em with del.

Wouldn't it be simpler for both humans and parsers to forbid punctuation at the beginning of an attribute key? (Or would it break too much existing text?)

@jgm
Copy link
Owner

jgm commented Sep 11, 2023

This is the grammar in the comments in attributes.ts:

 * syntax:
 *
 * attributes <- '{' whitespace* attribute (whitespace attribute)* whitespace* '}'
 * attribute <- identifier | class | keyval
 * identifier <- '#' name
 * class <- '.' name
 * name <- (nonspace, nonpunctuation other than ':', '_', '-')+
 * keyval <- key '=' val
 * key <- (ASCII_ALPHANUM | ':' | '_' | '-')+
 * val <- bareval | quotedval
 * bareval <- (ASCII_ALPHANUM | ':' | '_' | '-')+
 * quotedval <- '"' ([^"] | '\"') '"'

@jgm
Copy link
Owner

jgm commented Sep 11, 2023

I see the issue with allowing _ and - at the beginning of a key name, given the syntactic roles of {_ and {-. I'm open to tightening up the syntax here.
@matklad any thoughts?

@matklad
Copy link
Contributor

matklad commented Sep 25, 2023

Thoughts:

  • word{_key is a garden path syntax which might change the meaning depending on what follows, and that's bad for humans.
  • I think this is fairly likely to be hit in practice. While the example where key starts with _ and value ends with _ is contrived, only the first condition is enough. In particular, the following two examples parse differently: span{_key=val} vs span{ _key=val}.
  • Above convinces me that this is definitely a bug, and we should change something. If we allow _ in keys, we should make sure it actually works!
  • I like that we allow _, :, - in bare keys, attributes often have a hierarchical structure to them, and these chars separate parts.
  • If anything, I would love to add . as a symbol supported in keys. Eg, TOML allows
    fruit.name = "banana"
    Allowing . would conflict with the class syntax though
  • Leading _ and - also seem useful, they are sometimes used to namesapce "private" attributes

I am torn about what's the best solution here. Given that we already assign special meaning to .ident and #ident, it seems safest to require keys and values to start with ASCII_ALPHANUM (and then also allow . in the middle)

@matklad
Copy link
Contributor

matklad commented Sep 25, 2023

Actually, did something change? I can no longer reproduce the original example on the playground.

Here's what I get

word{_key=value_}
<p>word<em>key=value</em></p>

word{_key=value}
<p>word{_key=value}</p>

So that it seems that we just never parse {_ as attribute, and { _ is required for disambiguating.

@jgm
Copy link
Owner

jgm commented Sep 25, 2023

I never did check the actual behavior. Nonetheless, this is a parsing ambiguity. We should at the very least document that the emphasis interpretation takes precedence, and maybe go further and disallow _ at the beginning of keys.

As for . inside keys, I'm open to that.

@bpj
Copy link

bpj commented Sep 25, 2023

I think it's safer to disallow underscores at the start of keys, as it removes any ambiguity. I have a feeling that this will be a recurrent issue otherwise.

Does HTML allow underscores at the start of attribute names? Not that djot should be as bound to HTML as Pandoc's element types still largely are.

@andersk
Copy link

andersk commented Dec 18, 2023

It does; HTML allows attributes to consist of all characters other than controls, , ", ', >, /, =, and noncharacters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants