Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow any valid HTML4 identifier string to be a djot identifier string #303

Open
bmschmidt opened this issue May 4, 2024 · 2 comments
Open

Comments

@bmschmidt
Copy link

bmschmidt commented May 4, 2024

Thanks for your work; I am excited about using this project.

I'm converting some markdown files to djot with pandoc and am hitting an unfortunate behavior. I'm uncertain if it's a bug in pandoc's djot writer, a needed change in djot, or neither; would be willing to contribute in either codebase if it is one.

echo "# R."  | pandoc -f markdown -t djot

produces


{#r.}
# R.

which parses to

doc
  para
    str text="{#r.}"
    soft_break
    str text="# R."

The same text without the period at the end compiles to the desired

doc
  section id="r"
    heading level=1
      str text="R"

Of course djot can set whatever rules it wants on what belongs in an ID, which implies the pandoc writer should not be writing a djot-invalid identifier; but unless I'm missing something the simpler solution would seem to be allowing any valid SGML and HTML4 identifier to be a valid djot identifier, where "ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".")."

@jgm
Copy link
Owner

jgm commented May 5, 2024

Currently the syntax for attributes (undocumented except in code comments) is

 attributes <- '{' whitespace* attribute (whitespace attribute)* whitespace* '}'
 attribute <- identifier | class | keyval
 identifier <- '#' name
 class <- '.' name
 name <- (nonspace, nonpunctuation other than ':', '_', '-')+
 keyval <- key '=' val
 key <- (ASCII_ALPHANUM | ':' | '_' | '-')+
 val <- bareval | quotedval
 bareval <- (ASCII_ALPHANUM | ':' | '_' | '-')+
 quotedval <- '"' ([^"] | '\"') '"'

So we don't allow . in an identifer. I can't recall whether there was a specific reason for this.
XML identifiers are more restrictive than this (must start with letter or underscore). HTML4 identifiers are less restrictive, and HTML5 identifiers are much less restrictive.

Class names have more restrictions (at least if they're to be used with CSS).

EDIT: Anyway, I'm open to making this less restrictive, but some thought needs to go into what would be a reasonable restriction.

@bmschmidt
Copy link
Author

bmschmidt commented May 5, 2024

At first glance, it seems like djot has a principal to not distinguish between the first character and other characters in ids, possibly for simplicity of implementation? Which dictates that . can't appear in ids because '.' name indicates a class? Or possibly it's just that classes and ids follow the same pattern, and class name in djot may not contain periods (which I agree is a good decision).

HTML4 identifiers are less restrictive

As I understand it HTML4 ids are generally extremely restrictive, because they follow the SGML rules laid out ISO 8879:1986. #1, #:, and are all invalid HTML4 identifiers or class names, but valid djot identifiers because they don't start with [A-Za-z].

The only case I see where djot is more restrictive than HTML4 is that "foo.bar" is a valid HTML4 identifier but an invalid djot identifier because it contains a .This difference prevents a lot of pretty basic ascii-encoded HTML4 from being able round-trip through djot back to HTML.

I have one firm proposal, which is to disentangle the identifier and class rules to allow non-initial identifier characters to be periods. I.e.:

 identifier <- '#' nameChar Maybe[subsequentIdChar+]
 class <- '.' nameChar+
 nameChar <- (nonspace, nonpunctuation other than ':', '_', '-')
 subsequentIdChar <- (nonspace, nonpunctuation other than ':', '_', '-', '.')

My goals would be served equally well by the parser accepting periods on ids in any position but requiring them to be escaped (\.). But that feels uglier.

I don't have opinions about any larger related changes, though I do like how unicode characters can be id and class names in djot.


Just for context, I should possibly say that my interests here are not primarily in writing in DJOT, but in getting things into djot's AST, which is much nicer to work with than pandoc's for my purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants